Internal change PiperOrigin-RevId: 271275031 Change-Id: I69bce2b27644a3fff7bc445c567c8fab4a8ff234
diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..baf0444 --- /dev/null +++ b/LICENSE
@@ -0,0 +1,459 @@ + GNU LESSER GENERAL PUBLIC LICENSE + Version 2.1, February 1999 + + Copyright (C) 1991, 1999 Free Software Foundation, Inc. + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + +[This is the first released version of the Lesser GPL. It also counts + as the successor of the GNU Library Public License, version 2, hence + the version number 2.1.] + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +Licenses are intended to guarantee your freedom to share and change +free software--to make sure the software is free for all its users. + + This license, the Lesser General Public License, applies to some +specially designated software packages--typically libraries--of the +Free Software Foundation and other authors who decide to use it. You +can use it too, but we suggest you first think carefully about whether +this license or the ordinary General Public License is the better +strategy to use in any particular case, based on the explanations below. + + When we speak of free software, we are referring to freedom of use, +not price. Our General Public Licenses are designed to make sure that +you have the freedom to distribute copies of free software (and charge +for this service if you wish); that you receive source code or can get +it if you want it; that you can change the software and use pieces of +it in new free programs; and that you are informed that you can do +these things. + + To protect your rights, we need to make restrictions that forbid +distributors to deny you these rights or to ask you to surrender these +rights. These restrictions translate to certain responsibilities for +you if you distribute copies of the library or if you modify it. + + For example, if you distribute copies of the library, whether gratis +or for a fee, you must give the recipients all the rights that we gave +you. You must make sure that they, too, receive or can get the source +code. If you link other code with the library, you must provide +complete object files to the recipients, so that they can relink them +with the library after making changes to the library and recompiling +it. And you must show them these terms so they know their rights. + + We protect your rights with a two-step method: (1) we copyright the +library, and (2) we offer you this license, which gives you legal +permission to copy, distribute and/or modify the library. + + To protect each distributor, we want to make it very clear that +there is no warranty for the free library. Also, if the library is +modified by someone else and passed on, the recipients should know +that what they have is not the original version, so that the original +author's reputation will not be affected by problems that might be +introduced by others. + + Finally, software patents pose a constant threat to the existence of +any free program. We wish to make sure that a company cannot +effectively restrict the users of a free program by obtaining a +restrictive license from a patent holder. Therefore, we insist that +any patent license obtained for a version of the library must be +consistent with the full freedom of use specified in this license. + + Most GNU software, including some libraries, is covered by the +ordinary GNU General Public License. This license, the GNU Lesser +General Public License, applies to certain designated libraries, and +is quite different from the ordinary General Public License. We use +this license for certain libraries in order to permit linking those +libraries into non-free programs. + + When a program is linked with a library, whether statically or using +a shared library, the combination of the two is legally speaking a +combined work, a derivative of the original library. The ordinary +General Public License therefore permits such linking only if the +entire combination fits its criteria of freedom. The Lesser General +Public License permits more lax criteria for linking other code with +the library. + + We call this license the "Lesser" General Public License because it +does Less to protect the user's freedom than the ordinary General +Public License. It also provides other free software developers Less +of an advantage over competing non-free programs. These disadvantages +are the reason we use the ordinary General Public License for many +libraries. However, the Lesser license provides advantages in certain +special circumstances. + + For example, on rare occasions, there may be a special need to +encourage the widest possible use of a certain library, so that it becomes +a de-facto standard. To achieve this, non-free programs must be +allowed to use the library. A more frequent case is that a free +library does the same job as widely used non-free libraries. In this +case, there is little to gain by limiting the free library to free +software only, so we use the Lesser General Public License. + + In other cases, permission to use a particular library in non-free +programs enables a greater number of people to use a large body of +free software. For example, permission to use the GNU C Library in +non-free programs enables many more people to use the whole GNU +operating system, as well as its variant, the GNU/Linux operating +system. + + Although the Lesser General Public License is Less protective of the +users' freedom, it does ensure that the user of a program that is +linked with the Library has the freedom and the wherewithal to run +that program using a modified version of the Library. + + The precise terms and conditions for copying, distribution and +modification follow. Pay close attention to the difference between a +"work based on the library" and a "work that uses the library". The +former contains code derived from the library, whereas the latter must +be combined with the library in order to run. + + GNU LESSER GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License Agreement applies to any software library or other +program which contains a notice placed by the copyright holder or +other authorized party saying it may be distributed under the terms of +this Lesser General Public License (also called "this License"). +Each licensee is addressed as "you". + + A "library" means a collection of software functions and/or data +prepared so as to be conveniently linked with application programs +(which use some of those functions and data) to form executables. + + The "Library", below, refers to any such software library or work +which has been distributed under these terms. A "work based on the +Library" means either the Library or any derivative work under +copyright law: that is to say, a work containing the Library or a +portion of it, either verbatim or with modifications and/or translated +straightforwardly into another language. (Hereinafter, translation is +included without limitation in the term "modification".) + + "Source code" for a work means the preferred form of the work for +making modifications to it. For a library, complete source code means +all the source code for all modules it contains, plus any associated +interface definition files, plus the scripts used to control compilation +and installation of the library. + + Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running a program using the Library is not restricted, and output from +such a program is covered only if its contents constitute a work based +on the Library (independent of the use of the Library in a tool for +writing it). Whether that is true depends on what the Library does +and what the program that uses the Library does. + + 1. You may copy and distribute verbatim copies of the Library's +complete source code as you receive it, in any medium, provided that +you conspicuously and appropriately publish on each copy an +appropriate copyright notice and disclaimer of warranty; keep intact +all the notices that refer to this License and to the absence of any +warranty; and distribute a copy of this License along with the +Library. + + You may charge a fee for the physical act of transferring a copy, +and you may at your option offer warranty protection in exchange for a +fee. + + 2. You may modify your copy or copies of the Library or any portion +of it, thus forming a work based on the Library, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) The modified work must itself be a software library. + + b) You must cause the files modified to carry prominent notices + stating that you changed the files and the date of any change. + + c) You must cause the whole of the work to be licensed at no + charge to all third parties under the terms of this License. + + d) If a facility in the modified Library refers to a function or a + table of data to be supplied by an application program that uses + the facility, other than as an argument passed when the facility + is invoked, then you must make a good faith effort to ensure that, + in the event an application does not supply such function or + table, the facility still operates, and performs whatever part of + its purpose remains meaningful. + + (For example, a function in a library to compute square roots has + a purpose that is entirely well-defined independent of the + application. Therefore, Subsection 2d requires that any + application-supplied function or table used by this function must + be optional: if the application does not supply it, the square + root function must still compute square roots.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Library, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Library, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote +it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Library. + +In addition, mere aggregation of another work not based on the Library +with the Library (or with a work based on the Library) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may opt to apply the terms of the ordinary GNU General Public +License instead of this License to a given copy of the Library. To do +this, you must alter all the notices that refer to this License, so +that they refer to the ordinary GNU General Public License, version 2, +instead of to this License. (If a newer version than version 2 of the +ordinary GNU General Public License has appeared, then you can specify +that version instead if you wish.) Do not make any other change in +these notices. + + Once this change is made in a given copy, it is irreversible for +that copy, so the ordinary GNU General Public License applies to all +subsequent copies and derivative works made from that copy. + + This option is useful when you wish to copy part of the code of +the Library into a program that is not a library. + + 4. You may copy and distribute the Library (or a portion or +derivative of it, under Section 2) in object code or executable form +under the terms of Sections 1 and 2 above provided that you accompany +it with the complete corresponding machine-readable source code, which +must be distributed under the terms of Sections 1 and 2 above on a +medium customarily used for software interchange. + + If distribution of object code is made by offering access to copy +from a designated place, then offering equivalent access to copy the +source code from the same place satisfies the requirement to +distribute the source code, even though third parties are not +compelled to copy the source along with the object code. + + 5. A program that contains no derivative of any portion of the +Library, but is designed to work with the Library by being compiled or +linked with it, is called a "work that uses the Library". Such a +work, in isolation, is not a derivative work of the Library, and +therefore falls outside the scope of this License. + + However, linking a "work that uses the Library" with the Library +creates an executable that is a derivative of the Library (because it +contains portions of the Library), rather than a "work that uses the +library". The executable is therefore covered by this License. +Section 6 states terms for distribution of such executables. + + When a "work that uses the Library" uses material from a header file +that is part of the Library, the object code for the work may be a +derivative work of the Library even though the source code is not. +Whether this is true is especially significant if the work can be +linked without the Library, or if the work is itself a library. The +threshold for this to be true is not precisely defined by law. + + If such an object file uses only numerical parameters, data +structure layouts and accessors, and small macros and small inline +functions (ten lines or less in length), then the use of the object +file is unrestricted, regardless of whether it is legally a derivative +work. (Executables containing this object code plus portions of the +Library will still fall under Section 6.) + + Otherwise, if the work is a derivative of the Library, you may +distribute the object code for the work under the terms of Section 6. +Any executables containing that work also fall under Section 6, +whether or not they are linked directly with the Library itself. + + 6. As an exception to the Sections above, you may also combine or +link a "work that uses the Library" with the Library to produce a +work containing portions of the Library, and distribute that work +under terms of your choice, provided that the terms permit +modification of the work for the customer's own use and reverse +engineering for debugging such modifications. + + You must give prominent notice with each copy of the work that the +Library is used in it and that the Library and its use are covered by +this License. You must supply a copy of this License. If the work +during execution displays copyright notices, you must include the +copyright notice for the Library among them, as well as a reference +directing the user to the copy of this License. Also, you must do one +of these things: + + a) Accompany the work with the complete corresponding + machine-readable source code for the Library including whatever + changes were used in the work (which must be distributed under + Sections 1 and 2 above); and, if the work is an executable linked + with the Library, with the complete machine-readable "work that + uses the Library", as object code and/or source code, so that the + user can modify the Library and then relink to produce a modified + executable containing the modified Library. (It is understood + that the user who changes the contents of definitions files in the + Library will not necessarily be able to recompile the application + to use the modified definitions.) + + b) Use a suitable shared library mechanism for linking with the + Library. A suitable mechanism is one that (1) uses at run time a + copy of the library already present on the user's computer system, + rather than copying library functions into the executable, and (2) + will operate properly with a modified version of the library, if + the user installs one, as long as the modified version is + interface-compatible with the version that the work was made with. + + c) Accompany the work with a written offer, valid for at + least three years, to give the same user the materials + specified in Subsection 6a, above, for a charge no more + than the cost of performing this distribution. + + d) If distribution of the work is made by offering access to copy + from a designated place, offer equivalent access to copy the above + specified materials from the same place. + + e) Verify that the user has already received a copy of these + materials or that you have already sent this user a copy. + + For an executable, the required form of the "work that uses the +Library" must include any data and utility programs needed for +reproducing the executable from it. However, as a special exception, +the materials to be distributed need not include anything that is +normally distributed (in either source or binary form) with the major +components (compiler, kernel, and so on) of the operating system on +which the executable runs, unless that component itself accompanies +the executable. + + It may happen that this requirement contradicts the license +restrictions of other proprietary libraries that do not normally +accompany the operating system. Such a contradiction means you cannot +use both them and the Library together in an executable that you +distribute. + + 7. You may place library facilities that are a work based on the +Library side-by-side in a single library together with other library +facilities not covered by this License, and distribute such a combined +library, provided that the separate distribution of the work based on +the Library and of the other library facilities is otherwise +permitted, and provided that you do these two things: + + a) Accompany the combined library with a copy of the same work + based on the Library, uncombined with any other library + facilities. This must be distributed under the terms of the + Sections above. + + b) Give prominent notice with the combined library of the fact + that part of it is a work based on the Library, and explaining + where to find the accompanying uncombined form of the same work. + + 8. You may not copy, modify, sublicense, link with, or distribute +the Library except as expressly provided under this License. Any +attempt otherwise to copy, modify, sublicense, link with, or +distribute the Library is void, and will automatically terminate your +rights under this License. However, parties who have received copies, +or rights, from you under this License will not have their licenses +terminated so long as such parties remain in full compliance. + + 9. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Library or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Library (or any work based on the +Library), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Library or works based on it. + + 10. Each time you redistribute the Library (or any work based on the +Library), the recipient automatically receives a license from the +original licensor to copy, distribute, link with or modify the Library +subject to these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties with +this License. + + 11. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Library at all. For example, if a patent +license would not permit royalty-free redistribution of the Library by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Library. + +If any portion of this section is held invalid or unenforceable under any +particular circumstance, the balance of the section is intended to apply, +and the section as a whole is intended to apply in other circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 12. If the distribution and/or use of the Library is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Library under this License may add +an explicit geographical distribution limitation excluding those countries, +so that distribution is permitted only in or among countries not thus +excluded. In such case, this License incorporates the limitation as if +written in the body of this License. + + 13. The Free Software Foundation may publish revised and/or new +versions of the Lesser General Public License from time to time. +Such new versions will be similar in spirit to the present version, +but may differ in detail to address new problems or concerns. + +Each version is given a distinguishing version number. If the Library +specifies a version number of this License which applies to it and +"any later version", you have the option of following the terms and +conditions either of that version or of any later version published by +the Free Software Foundation. If the Library does not specify a +license version number, you may choose any version ever published by +the Free Software Foundation. + + 14. If you wish to incorporate parts of the Library into other free +programs whose distribution conditions are incompatible with these, +write to the author to ask for permission. For software which is +copyrighted by the Free Software Foundation, write to the Free +Software Foundation; we sometimes make exceptions for this. Our +decision will be guided by the two goals of preserving the free status +of all derivatives of our free software and of promoting the sharing +and reuse of software generally. + + NO WARRANTY + + 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO +WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. +EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR +OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY +KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE +LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME +THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN +WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY +AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU +FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR +CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE +LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING +RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A +FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF +SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH +DAMAGES. + + END OF TERMS AND CONDITIONS +
diff --git a/Makefile.gbase b/Makefile.gbase new file mode 100644 index 0000000..ad03d36 --- /dev/null +++ b/Makefile.gbase
@@ -0,0 +1,248 @@ +# +# Copyright (C) 2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# + +# Makefile for libacml_mv library + +# What we're building, and where to find it. +LIBRARY = libacml_mv.a + +TARGETS = $(LIBRARY) + +# Makefile setup +include $(COMMONDEFS) + +VPATH = $(BUILD_BASE)/src:$(BUILD_BASE)/src/gas + +# Compiler options +LCOPTS = $(STD_COMPILE_OPTS) $(STD_C_OPTS) +LCDEFS = $(HOSTDEFS) $(TARGDEFS) +LCINCS = -I$(BUILD_BASE)/inc + +# CFLAGS += -Wall -W -Wstrict-prototypes -Werror -fPIC -O2 $(DEBUG) + +ifeq ($(BUILD_ARCH), X8664) + +CFILES = \ + acos.c \ + acosf.c \ + acosh.c \ + acoshf.c \ + asin.c \ + asinf.c \ + asinh.c \ + asinhf.c \ + atan2.c \ + atan2f.c \ + atan.c \ + atanf.c \ + atanh.c \ + atanhf.c \ + ceil.c \ + ceilf.c \ + cosh.c \ + coshf.c \ + exp_special.c \ + finite.c \ + finitef.c \ + floor.c \ + floorf.c \ + frexp.c \ + frexpf.c \ + hypot.c \ + hypotf.c \ + ilogb.c \ + ilogbf.c \ + ldexp.c \ + ldexpf.c \ + libm_special.c \ + llrint.c \ + llrintf.c \ + llround.c \ + llroundf.c \ + log1p.c \ + log1pf.c \ + logb.c \ + logbf.c \ + log_special.c \ + lrint.c \ + lrintf.c \ + lround.c \ + lroundf.c \ + modf.c \ + modff.c \ + nan.c \ + nanf.c \ + nearbyintf.c \ + nextafter.c \ + nextafterf.c \ + nexttoward.c \ + nexttowardf.c \ + pow_special.c \ + remainder_piby2.c \ + remainder_piby2d2f.c \ + rint.c \ + rintf.c \ + roundf.c \ + scalbln.c \ + scalblnf.c \ + scalbn.c \ + scalbnf.c \ + sincos_special.c \ + sinh.c \ + sinhf.c \ + sqrt.c \ + sqrtf.c \ + tan.c \ + tanf.c \ + tanh.c \ + tanhf.c + +ASFILES = \ + cbrtf.S \ + cbrt.S \ + copysignf.S \ + copysign.S \ + cosf.S \ + cos.S \ + exp10f.S \ + exp10.S \ + exp2f.S \ + exp2.S \ + expf.S \ + expm1f.S \ + expm1.S \ + exp.S \ + fabsf.S \ + fabs.S \ + fdimf.S \ + fdim.S \ + fmaxf.S \ + fmax.S \ + fminf.S \ + fmin.S \ + fmodf.S \ + fmod.S \ + log10f.S \ + log10.S \ + log2f.S \ + log2.S \ + logf.S \ + log.S \ + nearbyint.S \ + powf.S \ + pow.S \ + remainderf.S \ + remainder.S \ + round.S \ + sincosf.S \ + sincos.S \ + sinf.S \ + sin.S \ + truncf.S \ + trunc.S \ + v4hcosl.S \ + v4helpl.S \ + v4hfrcpal.S \ + v4hlog10l.S \ + v4hlog2l.S \ + v4hlogl.S \ + v4hsinl.S \ + vrd2cos.S \ + vrd2exp.S \ + vrd2log10.S \ + vrd2log2.S \ + vrd2log.S \ + vrd2sincos.S \ + vrd2sin.S \ + vrd4cos.S \ + vrd4exp.S \ + vrd4frcpa.S \ + vrd4log10.S \ + vrd4log2.S \ + vrd4log.S \ + vrd4sin.S \ + vrdacos.S \ + vrdaexp.S \ + vrdalog10.S \ + vrdalog2.S \ + vrdalogr.S \ + vrdalog.S \ + vrda_scaled_logr.S \ + vrda_scaledshifted_logr.S \ + vrdasincos.S \ + vrdasin.S \ + vrs4cosf.S \ + vrs4expf.S \ + vrs4log10f.S \ + vrs4log2f.S \ + vrs4logf.S \ + vrs4powf.S \ + vrs4powxf.S \ + vrs4sincosf.S \ + vrs4sinf.S \ + vrs8expf.S \ + vrs8log10f.S \ + vrs8log2f.S \ + vrs8logf.S \ + vrsacosf.S \ + vrsaexpf.S \ + vrsalog10f.S \ + vrsalog2f.S \ + vrsalogf.S \ + vrsapowf.S \ + vrsapowxf.S \ + vrsasincosf.S \ + vrsasinf.S + +else + +# The special processing of the -lm option in the compiler driver should +# be delayed until all of the options have been parsed. Until the +# driver is cleaned up, it is important that processing be the same on +# all architectures. Thus we add an empty 32-bit ACML vector math +# library. + +dummy.c : + echo "void libacml_mv_placeholder() {}" > dummy.c + +CFILES = dummy.c +LDIRT += dummy.c + +endif + + +default: + $(MAKE) first + $(MAKE) $(TARGETS) + $(MAKE) last + +first : +ifndef SKIP_DEP_BUILD + $(call submake,$(BUILD_AREA)/include) +endif + +last : make_libdeps + +include $(COMMONRULES) + +$(LIBRARY): $(OBJECTS) + $(ar) cru $@ $^ + $(ranlib) $@ +
diff --git a/acml_trace.cc b/acml_trace.cc new file mode 100644 index 0000000..b5c967f --- /dev/null +++ b/acml_trace.cc
@@ -0,0 +1,86 @@ +// Copyright 2012 Google Inc. All Rights Reserved. +// Author: martint@google.com (Martin Thuresson) + +#include "third_party/open64_libacml_mv/acml_trace.h" + +#include <float.h> +#include <math.h> +#include <stdio.h> +#include <stdlib.h> +#include <sys/stat.h> +#include <sys/types.h> + +#include <functional> +#include <string> + +#include "base/commandlineflags.h" +#include "base/examine_stack.h" +#include "base/googleinit.h" +#include "base/init_google.h" +#include "base/logging.h" +#include "file/base/file.h" +#include "file/base/helpers.h" +#include "testing/base/public/benchmark.h" +#include "testing/base/public/googletest.h" +#include "third_party/absl/strings/cord.h" +#include "third_party/open64_libacml_mv/libacml.h" +#include "util/task/status.h" + +template<typename T> +std::unique_ptr<std::vector<T>> InitTrace( + const char* filename, + std::function<T(CordReader* reader)> callback) { + std::unique_ptr<std::vector<T>> trace(new std::vector<T>); + Cord cord; + CHECK_OK(file::GetContents(filename, &cord, file::Defaults())); + CordReader reader(cord); + + while (!reader.done()) { + trace->push_back(callback(&reader)); + } + + return trace; +} + +// Read a trace file with doubles. +std::unique_ptr<std::vector<double>> GetTraceDouble(const char *filename) { + std::function<double(CordReader* reader)> read_double = + [](CordReader* reader) { + double d; + CHECK_GE(reader->Available(), sizeof(d)); + reader->ReadN(sizeof(d), reinterpret_cast<char*>(&d)); + return d; + }; + std::unique_ptr<std::vector<double>> trace(InitTrace<double>(filename, + read_double)); + return trace; +} + +// Read a trace file with pairs of doubles. +std::unique_ptr<std::vector<std::pair<double, double>>> GetTraceDoublePair( + const char *filename) { + std::function<std::pair<double, double>(CordReader* reader)> read_double = + [](CordReader* reader) { + double d[2]; + CHECK_GE(reader->Available(), sizeof(d)); + reader->ReadN(sizeof(d), reinterpret_cast<char*>(&d)); + return std::make_pair(d[0], d[1]); + }; + std::unique_ptr<std::vector<std::pair<double, double>>> trace( + InitTrace<std::pair<double, double>>(filename, read_double)); + return trace; +} + +// Read a trace file with floats. +std::unique_ptr<std::vector<float>> GetTraceFloat(const char *filename) { + std::function<float(CordReader* reader)> read_float = + [](CordReader* reader) { + float f; + const int bytes_to_read = min(sizeof(f), reader->Available()); + reader->ReadN(bytes_to_read, reinterpret_cast<char*>(&f)); + return f; + }; + std::unique_ptr<std::vector<float>> trace(InitTrace<float>(filename, + read_float)); + return trace; +}
diff --git a/acml_trace.h b/acml_trace.h new file mode 100644 index 0000000..65eda94 --- /dev/null +++ b/acml_trace.h
@@ -0,0 +1,25 @@ +// Copyright 2012 and onwards Google Inc. +// Author: martint@google.com (Martin Thuresson) + +#ifndef THIRD_PARTY_OPEN64_LIBACML_MV_ACML_TRACE_H__ +#define THIRD_PARTY_OPEN64_LIBACML_MV_ACML_TRACE_H__ + +// Log files gathered from a complete run of rephil/docs. Contains the +// arguments to all exp/log/pow call. +#define BASE_TRACE_PATH "google3/third_party/open64_libacml_mv/testdata/" +#define EXP_LOGFILE (BASE_TRACE_PATH "/exp.rephil_docs.builtin.baseline.trace") +#define EXPF_LOGFILE (BASE_TRACE_PATH "/expf.fastmath_unittest.trace") +#define LOG_LOGFILE (BASE_TRACE_PATH "/log.rephil_docs.builtin.baseline.trace") +#define POW_LOGFILE (BASE_TRACE_PATH "/pow.rephil_docs.builtin.baseline.trace") + +#include <memory> +#include <vector> + +std::unique_ptr<std::vector<std::pair<double, double> >> GetTraceDoublePair( + const char *filename); + +std::unique_ptr<std::vector<double>> GetTraceDouble(const char *filename); + +std::unique_ptr<std::vector<float>> GetTraceFloat(const char *filename); + +#endif // THIRD_PARTY_OPEN64_LIBACML_MV_ACML_TRACE_H__
diff --git a/acml_trace_benchmark.cc b/acml_trace_benchmark.cc new file mode 100644 index 0000000..fb6acc4 --- /dev/null +++ b/acml_trace_benchmark.cc
@@ -0,0 +1,272 @@ +// Copyright 2012 Google Inc. All Rights Reserved. +// Author: martint@google.com (Martin Thuresson) + +#include "third_party/open64_libacml_mv/acml_trace.h" + +#include <float.h> +#include <math.h> +#include <stdio.h> +#include <stdlib.h> +#include <sys/stat.h> +#include <sys/types.h> + +#include <memory> +#include <vector> + +#include "base/commandlineflags.h" +#include "base/examine_stack.h" +#include "base/googleinit.h" +#include "base/init_google.h" +#include "base/logging.h" +#include "file/base/file.h" +#include "file/base/path.h" +#include "testing/base/public/benchmark.h" +#include "testing/base/public/googletest.h" +#include "third_party/open64_libacml_mv/libacml.h" + + +int main(int argc, char** argv) { + InitGoogle(argv[0], &argc, &argv, true); + RunSpecifiedBenchmarks(); + return 0; +} + +namespace { + +// Local typedefs to avoid repeating complex types all over the function. +typedef std::unique_ptr<std::vector<double>> DoubleListPtr; +typedef std::unique_ptr<std::vector<float>> FloatListPtr; +typedef std::unique_ptr<std::vector<std::pair<double, + double>>> DoublePairListPtr; + +///////////////////////// +// Benchmark log() calls. +///////////////////////// + +// Measure time spent iterating through the values. +static void BM_math_trace_read_log(int iters) { + // Read trace file into memory. + StopBenchmarkTiming(); + DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir, + LOG_LOGFILE).c_str())); + StartBenchmarkTiming(); + // Process trace. + double d = 0.0; + for (int iter = 0; iter < iters; ++iter) { + for (auto iter = trace->begin(); iter != trace->end(); ++iter) { + d += *iter; + } + } + CHECK_NE(d, 0.0); +} + +// Benchmark acml_log(). +static void BM_math_trace_acmllog(int iters) { + // Read trace file into memory. + StopBenchmarkTiming(); + DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir, + LOG_LOGFILE).c_str())); + StartBenchmarkTiming(); + double d = 0.0; + for (int iter = 0; iter < iters; ++iter) { + for (auto iter = trace->begin(); iter != trace->end(); ++iter) { + d += acml_log(*iter); + } + } + CHECK_NE(d, 0.0); +} + +// Benchmark log(). +static void BM_math_trace_log(int iters) { + // Read trace file into memory. + StopBenchmarkTiming(); + DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir, + LOG_LOGFILE).c_str())); + StartBenchmarkTiming(); + double d = 0.0; + for (int iter = 0; iter < iters; ++iter) { + for (auto iter = trace->begin(); iter != trace->end(); ++iter) { + d += log(*iter); + } + } + CHECK_NE(d, 0.0); +} + + +///////////////////////// +// Benchmark exp() calls. +///////////////////////// + +// Measure time spent iterating through the values. +static void BM_math_trace_read_exp(int iters) { + // Read trace file into memory. + StopBenchmarkTiming(); + DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir, + EXP_LOGFILE).c_str())); + StartBenchmarkTiming(); + double d = 0.0; + for (int iter = 0; iter < iters; ++iter) { + for (auto iter = trace->begin(); iter != trace->end(); ++iter) { + d += *iter; + } + } + CHECK_NE(d, 0.0); +} + +// Benchmark acml_exp(). +static void BM_math_trace_acmlexp(int iters) { + // Read trace file into memory. + StopBenchmarkTiming(); + DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir, + EXP_LOGFILE).c_str())); + StartBenchmarkTiming(); + double d = 0.0; + for (int iter = 0; iter < iters; ++iter) { + for (auto iter = trace->begin(); iter != trace->end(); ++iter) { + d += acml_exp(*iter); + } + } + CHECK_NE(d, 0.0); +} + +// Benchmark exp(). +static void BM_math_trace_exp(int iters) { + // Read trace file into memory. + StopBenchmarkTiming(); + DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir, + EXP_LOGFILE).c_str())); + StartBenchmarkTiming(); + double d = 0.0; + for (int iter = 0; iter < iters; ++iter) { + for (auto iter = trace->begin(); iter != trace->end(); ++iter) { + d += exp(*iter); + } + } + CHECK_NE(d, 0.0); +} + +///////////////////////// +// Benchmark expf() calls. +///////////////////////// + +// Measure time spent iterating through the values. +static void BM_math_trace_read_expf(int iters) { + // Read trace file into memory. + StopBenchmarkTiming(); + FloatListPtr trace(GetTraceFloat(file::JoinPath(FLAGS_test_srcdir, + EXPF_LOGFILE).c_str())); + StartBenchmarkTiming(); + float d = 0.0; + for (int iter = 0; iter < iters; ++iter) { + for (auto iter = trace->begin(); iter != trace->end(); ++iter) { + d += *iter; + } + } + CHECK_NE(d, 0.0); +} + +// Benchmark acml_exp(). +static void BM_math_trace_acmlexpf(int iters) { + // Read trace file into memory. + StopBenchmarkTiming(); + FloatListPtr trace(GetTraceFloat(file::JoinPath(FLAGS_test_srcdir, + EXPF_LOGFILE).c_str())); + StartBenchmarkTiming(); + float d = 0.0; + for (int iter = 0; iter < iters; ++iter) { + for (auto iter = trace->begin(); iter != trace->end(); ++iter) { + d += acml_expf(*iter); + } + } + CHECK_NE(d, 0.0); +} + +// Benchmark exp(). +static void BM_math_trace_expf(int iters) { + // Read trace file into memory. + StopBenchmarkTiming(); + FloatListPtr trace(GetTraceFloat(file::JoinPath(FLAGS_test_srcdir, + EXPF_LOGFILE).c_str())); + StartBenchmarkTiming(); + float d = 0.0; + for (int iter = 0; iter < iters; ++iter) { + for (auto iter = trace->begin(); iter != trace->end(); ++iter) { + d += expf(*iter); + } + } + CHECK_NE(d, 0.0); +} + + +///////////////////////// +// Benchmark pow() calls. +///////////////////////// + +// Measure time spent iterating through the values. +static void BM_math_trace_read_pow(int iters) { + // Read trace file into memory. + StopBenchmarkTiming(); + DoublePairListPtr trace(GetTraceDoublePair(file::JoinPath( + FLAGS_test_srcdir, POW_LOGFILE).c_str())); + StartBenchmarkTiming(); + double d = 0.0; + for (int iter = 0; iter < iters; ++iter) { + for (auto itr = trace->begin(); itr != trace->end(); ++itr) { + d += (*itr).first + (*itr).second; + } + } + CHECK_NE(d, 0.0); +} + +// Benchmark acml_pow(). +static void BM_math_trace_acmlpow(int iters) { + // Read trace file into memory. + StopBenchmarkTiming(); + DoublePairListPtr trace(GetTraceDoublePair(file::JoinPath( + FLAGS_test_srcdir, POW_LOGFILE).c_str())); + StartBenchmarkTiming(); + double d = 0.0; + for (int iter = 0; iter < iters; ++iter) { + for (auto itr = trace->begin(); itr != trace->end(); ++itr) { + d += acml_pow((*itr).first, + (*itr).second); + } + } + CHECK_NE(d, 0.0); +} + +// Benchmark pow(). +static void BM_math_trace_pow(int iters) { + // Read trace file into memory. + StopBenchmarkTiming(); + DoublePairListPtr trace(GetTraceDoublePair(file::JoinPath( + FLAGS_test_srcdir, POW_LOGFILE).c_str())); + StartBenchmarkTiming(); + double d = 0.0; + for (int iter = 0; iter < iters; ++iter) { + for (auto itr = trace->begin(); itr != trace->end(); ++itr) { + d += pow((*itr).first, + (*itr).second); + } + } + CHECK_NE(d, 0.0); +} + + +BENCHMARK(BM_math_trace_read_exp); +BENCHMARK(BM_math_trace_acmlexp); +BENCHMARK(BM_math_trace_exp); + +BENCHMARK(BM_math_trace_read_log); +BENCHMARK(BM_math_trace_acmllog); +BENCHMARK(BM_math_trace_log); + +BENCHMARK(BM_math_trace_read_pow); +BENCHMARK(BM_math_trace_acmlpow); +BENCHMARK(BM_math_trace_pow); + +BENCHMARK(BM_math_trace_read_expf); +BENCHMARK(BM_math_trace_acmlexpf); +BENCHMARK(BM_math_trace_expf); + +} // namespace
diff --git a/acml_trace_validate_test.cc b/acml_trace_validate_test.cc new file mode 100644 index 0000000..9bd682c --- /dev/null +++ b/acml_trace_validate_test.cc
@@ -0,0 +1,114 @@ +// Copyright 2012 Google Inc. All Rights Reserved. +// Author: martint@google.com (Martin Thuresson) + +#include "third_party/open64_libacml_mv/acml_trace.h" + +#include <math.h> +#include <stdio.h> + +#include <memory> +#include <vector> + +#include "base/commandlineflags.h" +#include "base/examine_stack.h" +#include "base/googleinit.h" +#include "base/init_google.h" +#include "base/logging.h" +#include "file/base/file.h" +#include "file/base/path.h" +#include "testing/base/public/benchmark.h" +#include "testing/base/public/googletest.h" +#include "testing/base/public/gunit.h" +#include "third_party/open64_libacml_mv/libacml.h" + + +int main(int argc, char** argv) { + InitGoogle(argv[0], &argc, &argv, true); + RunSpecifiedBenchmarks(); + return RUN_ALL_TESTS(); +} + + +// Compare two doubles given a maximum unit of least precision (ULP). +bool AlmostEqualDoubleUlps(double A, double B, int64 maxUlps) { + CHECK_EQ(sizeof(A), sizeof(maxUlps)); + if (A == B) + return true; + int64 intDiff = std::abs(*(reinterpret_cast<int64*>(&A)) - + *(reinterpret_cast<int64*>(&B))); + return intDiff <= maxUlps; +} + +// Compare two floats given a maximum unit of least precision (ULP). +bool AlmostEqualFloatUlps(float A, float B, int32 maxUlps) { + CHECK_EQ(sizeof(A), sizeof(maxUlps)); + if (A == B) + return true; + int32 intDiff = abs(*(reinterpret_cast<int32*>(&A)) - + *(reinterpret_cast<int32*>(&B))); + return intDiff <= maxUlps; +} + +TEST(Case, LogTest) { + // Read trace file into memory. + std::unique_ptr<std::vector<double>> trace( + GetTraceDouble(file::JoinPath(FLAGS_test_srcdir, + LOG_LOGFILE).c_str())); + double d1; + double d2; + for (auto iter = trace->begin(); iter != trace->end(); ++iter) { + d1 = acml_log(*iter); + d2 = log(*iter); + // Make sure difference is at most 1 ULP. + EXPECT_TRUE(AlmostEqualDoubleUlps(d1, d2, 1)); + } +} + +TEST(Case, ExpTest) { + // Read trace file into memory. + std::unique_ptr<std::vector<double>> trace( + GetTraceDouble(file::JoinPath(FLAGS_test_srcdir, + EXP_LOGFILE).c_str())); + double d1; + double d2; + for (auto iter = trace->begin(); iter != trace->end(); ++iter) { + d1 = acml_exp(*iter); + d2 = exp(*iter); + // Make sure difference is at most 1 ULP. + EXPECT_TRUE(AlmostEqualDoubleUlps(d1, d2, 1)); + } +} + + +TEST(Case, ExpfTest) { + // Read trace file into memory. + std::unique_ptr<std::vector<float>> trace( + GetTraceFloat(file::JoinPath(FLAGS_test_srcdir, + EXPF_LOGFILE).c_str())); + float f1; + float f2; + for (auto iter = trace->begin(); iter != trace->end(); ++iter) { + f1 = acml_expf(*iter); + f2 = expf(*iter); + // Make sure difference is at most 1 ULP. + EXPECT_TRUE(AlmostEqualFloatUlps(f1, f2, 1)); + } +} + + +TEST(Case, PowTest) { + // Read trace file into memory. + std::unique_ptr<std::vector<std::pair<double, double>>> trace( + GetTraceDoublePair(file::JoinPath(FLAGS_test_srcdir, + POW_LOGFILE).c_str())); + double d1; + double d2; + for (auto iter = trace->begin(); iter != trace->end(); ++iter) { + d1 = acml_pow((*iter).first, + (*iter).second); + d2 = pow((*iter).first, + (*iter).second); + // Make sure difference is at most 1 ULP. + EXPECT_TRUE(AlmostEqualDoubleUlps(d1, d2, 1)); + } +}
diff --git a/inc/acml_mv.h b/inc/acml_mv.h new file mode 100644 index 0000000..49b7feb --- /dev/null +++ b/inc/acml_mv.h
@@ -0,0 +1,81 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + + +/* +** A header file defining the C prototypes for the fast/vector libm functions +*/ + + +#ifdef __cplusplus +extern "C" { +#endif + + +/* +** The scalar routines. +*/ +double fastexp(double); +double fastlog(double); +double fastlog10(double); +double fastlog2(double); +double fastpow(double,double); +double fastsin(double); +double fastcos(double); +void fastsincos(double , double *, double *); + +float fastexpf(float ); +float fastlogf(float ); +float fastlog10f(float ); +float fastlog2f(float ); +float fastpowf(float,float); +float fastcosf(float ); +float fastsinf(float ); +void fastsincosf(float, float *,float *); + + +/* +** The array routines. +*/ +void vrda_exp(int, double *, double *); +void vrda_log(int, double *, double *); +void vrda_log10(int, double *, double *); +void vrda_log2(int, double *, double *); +void vrda_sin(int, double *, double *); +void vrda_cos(int, double *, double *); +void vrda_sincos(int, double *, double *, double *); + +void vrsa_expf(int, float *, float *); +void vrsa_logf(int, float *, float *); +void vrsa_log10f(int, float *, float *); +void vrsa_log2f(int, float *, float *); +void vrsa_powf(int n, float *x, float *y, float *z); +void vrsa_powxf(int n, float *x, float y, float *z); +void vrsa_sinf(int, float *, float *); +void vrsa_cosf(int, float *, float *); +void vrsa_sincosf(int, float *, float *, float *); + + +#ifdef __cplusplus +} +#endif
diff --git a/inc/acml_mv_m128.h b/inc/acml_mv_m128.h new file mode 100644 index 0000000..c783fe3 --- /dev/null +++ b/inc/acml_mv_m128.h
@@ -0,0 +1,103 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + + +/* +** A header file defining the C prototypes for the fast/vector libm functions +*/ + + +#ifdef __cplusplus +extern "C" { +#endif + + +/* +** The scalar routines. +*/ +double fastexp(double); +double fastlog(double); +double fastlog10(double); +double fastlog2(double); +double fastpow(double,double); +double fastsin(double); +double fastcos(double); +void fastsincos(double , double *, double *); + +float fastexpf(float ); +float fastlogf(float ); +float fastlog10f(float ); +float fastlog2f(float ); +float fastpowf(float,float); +float fastcosf(float ); +float fastsinf(float ); +void fastsincosf(float, float *,float *); + +/* +** The single vector routines. +*/ +__m128d __vrd2_log(__m128d); +__m128d __vrd2_exp(__m128d); +__m128d __vrd2_log10(__m128d); +__m128d __vrd2_log2(__m128d); +__m128d __vrd2_sin(__m128d); +__m128d __vrd2_cos(__m128d); +void __vrd2_sincos(__m128d, __m128d *, __m128d *); + +__m128 __vrs4_expf(__m128); +__m128 __vrs4_logf(__m128); +__m128 __vrs4_log10f(__m128); +__m128 __vrs4_log2f(__m128); +__m128 __vrs4_powf(__m128,__m128); +__m128 __vrs4_powxf(__m128 x,float y); +__m128 __vrs4_sinf(__m128); +__m128 __vrs4_cosf(__m128); +void __vrs4_sincosf(__m128, __m128 *, __m128 *); + + +/* +** The array routines. +*/ +void vrda_exp(int, double *, double *); +void vrda_log(int, double *, double *); +void vrda_log10(int, double *, double *); +void vrda_log2(int, double *, double *); +void vrda_sin(int, double *, double *); +void vrda_cos(int, double *, double *); +void vrda_sincos(int, double *, double *, double *); + +void vrsa_expf(int, float *, float *); +void vrsa_logf(int, float *, float *); +void vrsa_log10f(int, float *, float *); +void vrsa_log2f(int, float *, float *); +void vrsa_powf(int n, float *x, float *y, float *z); +void vrsa_powxf(int n, float *x, float y, float *z); +void vrsa_sinf(int, float *, float *); +void vrsa_cosf(int, float *, float *); +void vrsa_sincosf(int, float *, float *, float *); + + + +#ifdef __cplusplus +} +#endif
diff --git a/inc/fn_macros.h b/inc/fn_macros.h new file mode 100644 index 0000000..afc2f59 --- /dev/null +++ b/inc/fn_macros.h
@@ -0,0 +1,47 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifndef __FN_MACROS_H__ +#define __FN_MACROS_H__ + +#if defined(WINDOWS) +#pragma warning( disable : 4985 ) +#define FN_PROTOTYPE(fn_name) acml_impl_##fn_name +#else +/* For Linux we prepend function names by a double underscore */ +#define ACML_CONCAT(x,y) x##y +/* #define FN_PROTOTYPE(fn_name) concat(__,fn_name) */ +#define FN_PROTOTYPE(fn_name) ACML_CONCAT(acml_impl_,fn_name) /* commenting out previous line for build success, !!!!! REVISIT THIS SOON !!!!! */ +#endif + + +#if defined(WINDOWS) +#define weak_alias(name, aliasname) /* as nothing */ +#else +/* Define ALIASNAME as a weak alias for NAME. + If weak aliases are not available, this defines a strong alias. */ +#define weak_alias(name, aliasname) /* _weak_alias (name, aliasname) */ /* !!!!! REVISIT THIS SOON !!!!! */ +#define _weak_alias(name, aliasname) extern __typeof (name) aliasname __attribute__ ((weak, alias (#name))); +#endif + +#endif // __FN_MACROS_H__
diff --git a/inc/libm_amd.h b/inc/libm_amd.h new file mode 100644 index 0000000..66cd46c --- /dev/null +++ b/inc/libm_amd.h
@@ -0,0 +1,225 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifndef LIBM_AMD_H_INCLUDED +#define LIBM_AMD_H_INCLUDED 1 + +#include <emmintrin.h> +#include "acml_mv.h" +#include "acml_mv_m128.h" + +#include "fn_macros.h" + +#ifdef __cplusplus +extern "C" { +#endif + + + double FN_PROTOTYPE(cbrt)(double x); + float FN_PROTOTYPE(cbrtf)(float x); + + double FN_PROTOTYPE(fabs)(double x); + float FN_PROTOTYPE(fabsf)(float x); + +double FN_PROTOTYPE(acos)(double x); + float FN_PROTOTYPE(acosf)(float x); + + double FN_PROTOTYPE(acosh)(double x); + float FN_PROTOTYPE(acoshf)(float x); + + double FN_PROTOTYPE(asin)(double x); + float FN_PROTOTYPE(asinf)(float x); + + double FN_PROTOTYPE( asinh)(double x); + float FN_PROTOTYPE(asinhf)(float x); + + double FN_PROTOTYPE( atan)(double x); + float FN_PROTOTYPE(atanf)(float x); + + double FN_PROTOTYPE( atanh)(double x); + float FN_PROTOTYPE(atanhf)(float x); + + double FN_PROTOTYPE( atan2)(double x, double y); + float FN_PROTOTYPE(atan2f)(float x, float y); + + double FN_PROTOTYPE( ceil)(double x); + float FN_PROTOTYPE(ceilf)(float x); + + + double FN_PROTOTYPE( cos)(double x); + float FN_PROTOTYPE(cosf)(float x); + + double FN_PROTOTYPE( cosh)(double x); + float FN_PROTOTYPE(coshf)(float x); + + double FN_PROTOTYPE( exp)(double x); + float FN_PROTOTYPE(expf)(float x); + + double FN_PROTOTYPE( expm1)(double x); + float FN_PROTOTYPE(expm1f)(float x); + + double FN_PROTOTYPE( exp2)(double x); + float FN_PROTOTYPE(exp2f)(float x); + + double FN_PROTOTYPE( exp10)(double x); + float FN_PROTOTYPE(exp10f)(float x); + + + double FN_PROTOTYPE( fdim)(double x, double y); + float FN_PROTOTYPE(fdimf)(float x, float y); + +#ifdef WINDOWS + int FN_PROTOTYPE(finite)(double x); + int FN_PROTOTYPE(finitef)(float x); +#else + int FN_PROTOTYPE(finite)(double x); + int FN_PROTOTYPE(finitef)(float x); +#endif + + double FN_PROTOTYPE( floor)(double x); + float FN_PROTOTYPE(floorf)(float x); + + double FN_PROTOTYPE( fmax)(double x, double y); + float FN_PROTOTYPE(fmaxf)(float x, float y); + + double FN_PROTOTYPE( fmin)(double x, double y); + float FN_PROTOTYPE(fminf)(float x, float y); + + double FN_PROTOTYPE( fmod)(double x, double y); + float FN_PROTOTYPE(fmodf)(float x, float y); + +#ifdef WINDOWS + double FN_PROTOTYPE( hypot)(double x, double y); + float FN_PROTOTYPE(hypotf)(float x, float y); +#else + double FN_PROTOTYPE( hypot)(double x, double y); + float FN_PROTOTYPE(hypotf)(float x, float y); +#endif + + float FN_PROTOTYPE(ldexpf)(float x, int exp); + + double FN_PROTOTYPE(ldexp)(double x, int exp); + + double FN_PROTOTYPE( log)(double x); + float FN_PROTOTYPE(logf)(float x); + + + float FN_PROTOTYPE(log2f)(float x); + + double FN_PROTOTYPE( log10)(double x); + float FN_PROTOTYPE(log10f)(float x); + + + float FN_PROTOTYPE(log1pf)(float x); + +#ifdef WINDOWS + double FN_PROTOTYPE( logb)(double x); + float FN_PROTOTYPE(logbf)(float x); +#else + double FN_PROTOTYPE( logb)(double x); + float FN_PROTOTYPE(logbf)(float x); +#endif + + double FN_PROTOTYPE( modf)(double x, double *iptr); + float FN_PROTOTYPE(modff)(float x, float *iptr); + + double FN_PROTOTYPE( nextafter)(double x, double y); + float FN_PROTOTYPE(nextafterf)(float x, float y); + + double FN_PROTOTYPE( pow)(double x, double y); + float FN_PROTOTYPE(powf)(float x, float y); + +double FN_PROTOTYPE( remainder)(double x, double y); + float FN_PROTOTYPE(remainderf)(float x, float y); + + double FN_PROTOTYPE(sin)(double x); + float FN_PROTOTYPE(sinf)(float x); + + void FN_PROTOTYPE(sincos)(double x, double *s, double *c); + void FN_PROTOTYPE(sincosf)(float x, float *s, float *c); + + double FN_PROTOTYPE( sinh)(double x); + float FN_PROTOTYPE(sinhf)(float x); + + double FN_PROTOTYPE( sqrt)(double x); + float FN_PROTOTYPE(sqrtf)(float x); + + double FN_PROTOTYPE( tan)(double x); + float FN_PROTOTYPE(tanf)(float x); + + double FN_PROTOTYPE( tanh)(double x); + float FN_PROTOTYPE(tanhf)(float x); + + double FN_PROTOTYPE( trunc)(double x); + float FN_PROTOTYPE(truncf)(float x); + + double FN_PROTOTYPE( log1p)(double x); + double FN_PROTOTYPE( log2)(double x); + + double FN_PROTOTYPE(cosh)(double x); + float FN_PROTOTYPE(coshf)(float fx); + + double FN_PROTOTYPE(frexp)(double value, int *exp); + float FN_PROTOTYPE(frexpf)(float value, int *exp); + int FN_PROTOTYPE(ilogb)(double x); + int FN_PROTOTYPE(ilogbf)(float x); + + long long int FN_PROTOTYPE(llrint)(double x); + long long int FN_PROTOTYPE(llrintf)(float x); + long int FN_PROTOTYPE(lrint)(double x); + long int FN_PROTOTYPE(lrintf)(float x); + long int FN_PROTOTYPE(lround)(double d); + long int FN_PROTOTYPE(lroundf)(float f); + double FN_PROTOTYPE(nan)(const char *tagp); + float FN_PROTOTYPE(nanf)(const char *tagp); + float FN_PROTOTYPE(nearbyintf)(float x); + double FN_PROTOTYPE(nearbyint)(double x); + double FN_PROTOTYPE(nextafter)(double x, double y); + float FN_PROTOTYPE(nextafterf)(float x, float y); + double FN_PROTOTYPE(nexttoward)(double x, long double y); + float FN_PROTOTYPE(nexttowardf)(float x, long double y); + double FN_PROTOTYPE(rint)(double x); + float FN_PROTOTYPE(rintf)(float x); + float FN_PROTOTYPE(roundf)(float f); + double FN_PROTOTYPE(round)(double f); + double FN_PROTOTYPE(scalbln)(double x, long int n); + float FN_PROTOTYPE(scalblnf)(float x, long int n); + double FN_PROTOTYPE(scalbn)(double x, int n); + float FN_PROTOTYPE(scalbnf)(float x, int n); + long long int FN_PROTOTYPE(llroundf)(float f); + long long int FN_PROTOTYPE(llround)(double d); + + +#ifdef WINDOWS + double FN_PROTOTYPE(copysign)(double x, double y); + float FN_PROTOTYPE(copysignf)(float x, float y); +#else + double FN_PROTOTYPE(copysign)(double x, double y); + float FN_PROTOTYPE(copysignf)(float x, float y); +#endif + +#ifdef __cplusplus +} +#endif + +#endif /* LIBM_AMD_H_INCLUDED */
diff --git a/inc/libm_errno_amd.h b/inc/libm_errno_amd.h new file mode 100644 index 0000000..1e6b8b9 --- /dev/null +++ b/inc/libm_errno_amd.h
@@ -0,0 +1,33 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifndef LIBM_ERRNO_AMD_H_INCLUDED +#define LIBM_ERRNO_AMD_H_INCLUDED 1 + +#include <stdio.h> +#include <errno.h> +#ifndef __set_errno +#define __set_errno(x) errno = (x) +#endif + +#endif /* LIBM_ERRNO_AMD_H_INCLUDED */
diff --git a/inc/libm_inlines_amd.h b/inc/libm_inlines_amd.h new file mode 100644 index 0000000..a2e387a --- /dev/null +++ b/inc/libm_inlines_amd.h
@@ -0,0 +1,2188 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifndef LIBM_INLINES_AMD_H_INCLUDED +#define LIBM_INLINES_AMD_H_INCLUDED 1 + +#include "libm_util_amd.h" +#include <math.h> + +#ifdef WINDOWS +#define inline __inline +#include "emmintrin.h" +#endif + +/* Compile-time verification that type long is the same size + as type double (i.e. we are really on a 64-bit machine) */ +void check_long_against_double_size(int machine_is_64_bit[(sizeof(long long) == sizeof(double))?1:-1]); + +/* Set defines for inline functions calling other inlines */ +#if defined(USE_VAL_WITH_FLAGS) || defined(USE_VALF_WITH_FLAGS) || \ + defined(USE_ZERO_WITH_FLAGS) || defined(USE_ZEROF_WITH_FLAGS) || \ + defined(USE_NAN_WITH_FLAGS) || defined(USE_NANF_WITH_FLAGS) || \ + defined(USE_INDEFINITE_WITH_FLAGS) || defined(USE_INDEFINITEF_WITH_FLAGS) || \ + defined(USE_INFINITY_WITH_FLAGS) || defined(USE_INFINITYF_WITH_FLAGS) || \ + defined(USE_SQRT_AMD_INLINE) || defined(USE_SQRTF_AMD_INLINE) || \ + (defined(WINDOWS) && (defined(USE_HANDLE_ERROR) || defined(USE_HANDLE_ERRORF))) +#undef USE_RAISE_FPSW_FLAGS +#define USE_RAISE_FPSW_FLAGS 1 +#endif + +#if defined(USE_SPLITDOUBLE) +/* Splits double x into exponent e and mantissa m, where 0.5 <= abs(m) < 1.0. + Assumes that x is not zero, denormal, infinity or NaN, but these conditions + are not checked */ +static inline void splitDouble(double x, int *e, double *m) +{ + unsigned long long ux, uy; + GET_BITS_DP64(x, ux); + uy = ux; + ux &= EXPBITS_DP64; + ux >>= EXPSHIFTBITS_DP64; + *e = (int)ux - EXPBIAS_DP64 + 1; + uy = (uy & (SIGNBIT_DP64 | MANTBITS_DP64)) | HALFEXPBITS_DP64; + PUT_BITS_DP64(uy, x); + *m = x; +} +#endif /* USE_SPLITDOUBLE */ + + +#if defined(USE_SPLITDOUBLE_2) +/* Splits double x into exponent e and mantissa m, where 1.0 <= abs(m) < 4.0. + Assumes that x is not zero, denormal, infinity or NaN, but these conditions + are not checked. Also assumes EXPBIAS_DP is odd. With this + assumption, e will be even on exit. */ +static inline void splitDouble_2(double x, int *e, double *m) +{ + unsigned long long ux, vx; + GET_BITS_DP64(x, ux); + vx = ux; + ux &= EXPBITS_DP64; + ux >>= EXPSHIFTBITS_DP64; + if (ux & 1) + { + /* The exponent is odd */ + vx = (vx & (SIGNBIT_DP64 | MANTBITS_DP64)) | ONEEXPBITS_DP64; + PUT_BITS_DP64(vx, x); + *m = x; + *e = ux - EXPBIAS_DP64; + } + else + { + /* The exponent is even */ + vx = (vx & (SIGNBIT_DP64 | MANTBITS_DP64)) | TWOEXPBITS_DP64; + PUT_BITS_DP64(vx, x); + *m = x; + *e = ux - EXPBIAS_DP64 - 1; + } +} +#endif /* USE_SPLITDOUBLE_2 */ + + +#if defined(USE_SPLITFLOAT) +/* Splits float x into exponent e and mantissa m, where 0.5 <= abs(m) < 1.0. + Assumes that x is not zero, denormal, infinity or NaN, but these conditions + are not checked */ +static inline void splitFloat(float x, int *e, float *m) +{ + unsigned int ux, uy; + GET_BITS_SP32(x, ux); + uy = ux; + ux &= EXPBITS_SP32; + ux >>= EXPSHIFTBITS_SP32; + *e = (int)ux - EXPBIAS_SP32 + 1; + uy = (uy & (SIGNBIT_SP32 | MANTBITS_SP32)) | HALFEXPBITS_SP32; + PUT_BITS_SP32(uy, x); + *m = x; +} +#endif /* USE_SPLITFLOAT */ + + +#if defined(USE_SCALEDOUBLE_1) +/* Scales the double x by 2.0**n. + Assumes EMIN <= n <= EMAX, though this condition is not checked. */ +static inline double scaleDouble_1(double x, int n) +{ + double t; + /* Construct the number t = 2.0**n */ + PUT_BITS_DP64(((long long)n + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t); + return x*t; +} +#endif /* USE_SCALEDOUBLE_1 */ + + +#if defined(USE_SCALEDOUBLE_2) +/* Scales the double x by 2.0**n. + Assumes 2*EMIN <= n <= 2*EMAX, though this condition is not checked. */ +static inline double scaleDouble_2(double x, int n) +{ + double t1, t2; + int n1, n2; + n1 = n / 2; + n2 = n - n1; + /* Construct the numbers t1 = 2.0**n1 and t2 = 2.0**n2 */ + PUT_BITS_DP64(((long long)n1 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t1); + PUT_BITS_DP64(((long long)n2 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t2); + return (x*t1)*t2; +} +#endif /* USE_SCALEDOUBLE_2 */ + + +#if defined(USE_SCALEDOUBLE_3) +/* Scales the double x by 2.0**n. + Assumes 3*EMIN <= n <= 3*EMAX, though this condition is not checked. */ +static inline double scaleDouble_3(double x, int n) +{ + double t1, t2, t3; + int n1, n2, n3; + n1 = n / 3; + n2 = (n - n1) / 2; + n3 = n - n1 - n2; + /* Construct the numbers t1 = 2.0**n1, t2 = 2.0**n2 and t3 = 2.0**n3 */ + PUT_BITS_DP64(((long long)n1 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t1); + PUT_BITS_DP64(((long long)n2 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t2); + PUT_BITS_DP64(((long long)n3 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t3); + return ((x*t1)*t2)*t3; +} +#endif /* USE_SCALEDOUBLE_3 */ + + +#if defined(USE_SCALEFLOAT_1) +/* Scales the float x by 2.0**n. + Assumes EMIN <= n <= EMAX, though this condition is not checked. */ +static inline float scaleFloat_1(float x, int n) +{ + float t; + /* Construct the number t = 2.0**n */ + PUT_BITS_SP32((n + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t); + return x*t; +} +#endif /* USE_SCALEFLOAT_1 */ + + +#if defined(USE_SCALEFLOAT_2) +/* Scales the float x by 2.0**n. + Assumes 2*EMIN <= n <= 2*EMAX, though this condition is not checked. */ +static inline float scaleFloat_2(float x, int n) +{ + float t1, t2; + int n1, n2; + n1 = n / 2; + n2 = n - n1; + /* Construct the numbers t1 = 2.0**n1 and t2 = 2.0**n2 */ + PUT_BITS_SP32((n1 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t1); + PUT_BITS_SP32((n2 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t2); + return (x*t1)*t2; +} +#endif /* USE_SCALEFLOAT_2 */ + + +#if defined(USE_SCALEFLOAT_3) +/* Scales the float x by 2.0**n. + Assumes 3*EMIN <= n <= 3*EMAX, though this condition is not checked. */ +static inline float scaleFloat_3(float x, int n) +{ + float t1, t2, t3; + int n1, n2, n3; + n1 = n / 3; + n2 = (n - n1) / 2; + n3 = n - n1 - n2; + /* Construct the numbers t1 = 2.0**n1, t2 = 2.0**n2 and t3 = 2.0**n3 */ + PUT_BITS_SP32((n1 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t1); + PUT_BITS_SP32((n2 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t2); + PUT_BITS_SP32((n3 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t3); + return ((x*t1)*t2)*t3; +} +#endif /* USE_SCALEFLOAT_3 */ + +#if defined(USE_SETPRECISIONDOUBLE) +unsigned int setPrecisionDouble(void) +{ + unsigned int cw, cwold = 0; + /* There is no precision control on Hammer */ + return cwold; +} +#endif /* USE_SETPRECISIONDOUBLE */ + +#if defined(USE_RESTOREPRECISION) +void restorePrecision(unsigned int cwold) +{ +#if defined(WINDOWS) + /* There is no precision control on Hammer */ +#elif defined(linux) + /* There is no precision control on Hammer */ +#else +#error Unknown machine +#endif + return; +} +#endif /* USE_RESTOREPRECISION */ + + +#if defined(USE_CLEAR_FPSW_FLAGS) +/* Clears floating-point status flags. The argument should be + the bitwise or of the flags to be cleared, from the + list above, e.g. + clear_fpsw_flags(AMD_F_INEXACT | AMD_F_INVALID); + */ +static inline void clear_fpsw_flags(int flags) +{ +#if defined(WINDOWS) + unsigned int cw = _mm_getcsr(); + cw &= (~flags); + _mm_setcsr(cw); +#elif defined(linux) + unsigned int cw; + /* Get the current floating-point control/status word */ + asm volatile ("STMXCSR %0" : "=m" (cw)); + cw &= (~flags); + asm volatile ("LDMXCSR %0" : : "m" (cw)); +#else +#error Unknown machine +#endif +} +#endif /* USE_CLEAR_FPSW_FLAGS */ + + +#if defined(USE_RAISE_FPSW_FLAGS) +/* Raises floating-point status flags. The argument should be + the bitwise or of the flags to be raised, from the + list above, e.g. + raise_fpsw_flags(AMD_F_INEXACT | AMD_F_INVALID); + */ +static inline void raise_fpsw_flags(int flags) +{ +#if defined(WINDOWS) + _mm_setcsr(_mm_getcsr() | flags); +#elif defined(linux) + unsigned int cw; + /* Get the current floating-point control/status word */ + asm volatile ("STMXCSR %0" : "=m" (cw)); + cw |= flags; + asm volatile ("LDMXCSR %0" : : "m" (cw)); +#else +#error Unknown machine +#endif +} +#endif /* USE_RAISE_FPSW_FLAGS */ + + +#if defined(USE_GET_FPSW_INLINE) +/* Return the current floating-point status word */ +static inline unsigned int get_fpsw_inline(void) +{ +#if defined(WINDOWS) + return _mm_getcsr(); +#elif defined(linux) + unsigned int sw; + asm volatile ("STMXCSR %0" : "=m" (sw)); + return sw; +#else +#error Unknown machine +#endif +} +#endif /* USE_GET_FPSW_INLINE */ + +#if defined(USE_SET_FPSW_INLINE) +/* Set the floating-point status word */ +static inline void set_fpsw_inline(unsigned int sw) +{ +#if defined(WINDOWS) + _mm_setcsr(sw); +#elif defined(linux) + /* Set the current floating-point control/status word */ + asm volatile ("LDMXCSR %0" : : "m" (sw)); +#else +#error Unknown machine +#endif +} +#endif /* USE_SET_FPSW_INLINE */ + +#if defined(USE_CLEAR_FPSW_INLINE) +/* Clear all exceptions from the floating-point status word */ +static inline void clear_fpsw_inline(void) +{ +#if defined(WINDOWS) + unsigned int cw; + cw = _mm_getcsr(); + cw &= ~(AMD_F_INEXACT | AMD_F_UNDERFLOW | AMD_F_OVERFLOW | + AMD_F_DIVBYZERO | AMD_F_INVALID); + _mm_setcsr(cw); +#elif defined(linux) + unsigned int cw; + /* Get the current floating-point control/status word */ + asm volatile ("STMXCSR %0" : "=m" (cw)); + cw &= ~(AMD_F_INEXACT | AMD_F_UNDERFLOW | AMD_F_OVERFLOW | + AMD_F_DIVBYZERO | AMD_F_INVALID); + asm volatile ("LDMXCSR %0" : : "m" (cw)); +#else +#error Unknown machine +#endif +} +#endif /* USE_CLEAR_FPSW_INLINE */ + + +#if defined(USE_VAL_WITH_FLAGS) +/* Returns a double value after raising the given flags, + e.g. val_with_flags(x, AMD_F_INEXACT); + */ +static inline double val_with_flags(double val, int flags) +{ + raise_fpsw_flags(flags); + return val; +} +#endif /* USE_VAL_WITH_FLAGS */ + +#if defined(USE_VALF_WITH_FLAGS) +/* Returns a float value after raising the given flags, + e.g. valf_with_flags(x, AMD_F_INEXACT); + */ +static inline float valf_with_flags(float val, int flags) +{ + raise_fpsw_flags(flags); + return val; +} +#endif /* USE_VALF_WITH_FLAGS */ + + +#if defined(USE_ZERO_WITH_FLAGS) +/* Returns a double +zero after raising the given flags, + e.g. zero_with_flags(AMD_F_INEXACT | AMD_F_INVALID); + */ +static inline double zero_with_flags(int flags) +{ + raise_fpsw_flags(flags); + return 0.0; +} +#endif /* USE_ZERO_WITH_FLAGS */ + + +#if defined(USE_ZEROF_WITH_FLAGS) +/* Returns a float +zero after raising the given flags, + e.g. zerof_with_flags(AMD_F_INEXACT | AMD_F_INVALID); + */ +static inline float zerof_with_flags(int flags) +{ + raise_fpsw_flags(flags); + return 0.0F; +} +#endif /* USE_ZEROF_WITH_FLAGS */ + + +#if defined(USE_NAN_WITH_FLAGS) +/* Returns a double quiet +nan after raising the given flags, + e.g. nan_with_flags(AMD_F_INVALID); +*/ +static inline double nan_with_flags(int flags) +{ + double z; + raise_fpsw_flags(flags); + PUT_BITS_DP64(0x7ff8000000000000, z); + return z; +} +#endif /* USE_NAN_WITH_FLAGS */ + +#if defined(USE_NANF_WITH_FLAGS) +/* Returns a float quiet +nan after raising the given flags, + e.g. nanf_with_flags(AMD_F_INVALID); +*/ +static inline float nanf_with_flags(int flags) +{ + float z; + raise_fpsw_flags(flags); + PUT_BITS_SP32(0x7fc00000, z); + return z; +} +#endif /* USE_NANF_WITH_FLAGS */ + + +#if defined(USE_INDEFINITE_WITH_FLAGS) +/* Returns a double indefinite after raising the given flags, + e.g. indefinite_with_flags(AMD_F_INVALID); +*/ +static inline double indefinite_with_flags(int flags) +{ + double z; + raise_fpsw_flags(flags); + PUT_BITS_DP64(0xfff8000000000000, z); + return z; +} +#endif /* USE_INDEFINITE_WITH_FLAGS */ + +#if defined(USE_INDEFINITEF_WITH_FLAGS) +/* Returns a float quiet +indefinite after raising the given flags, + e.g. indefinitef_with_flags(AMD_F_INVALID); +*/ +static inline float indefinitef_with_flags(int flags) +{ + float z; + raise_fpsw_flags(flags); + PUT_BITS_SP32(0xffc00000, z); + return z; +} +#endif /* USE_INDEFINITEF_WITH_FLAGS */ + + +#ifdef USE_INFINITY_WITH_FLAGS +/* Returns a positive double infinity after raising the given flags, + e.g. infinity_with_flags(AMD_F_OVERFLOW); +*/ +static inline double infinity_with_flags(int flags) +{ + double z; + raise_fpsw_flags(flags); + PUT_BITS_DP64((unsigned long long)(BIASEDEMAX_DP64 + 1) << EXPSHIFTBITS_DP64, z); + return z; +} +#endif /* USE_INFINITY_WITH_FLAGS */ + +#ifdef USE_INFINITYF_WITH_FLAGS +/* Returns a positive float infinity after raising the given flags, + e.g. infinityf_with_flags(AMD_F_OVERFLOW); +*/ +static inline float infinityf_with_flags(int flags) +{ + float z; + raise_fpsw_flags(flags); + PUT_BITS_SP32((BIASEDEMAX_SP32 + 1) << EXPSHIFTBITS_SP32, z); + return z; +} +#endif /* USE_INFINITYF_WITH_FLAGS */ + + +#if defined(USE_SPLITEXP) +/* Compute the values m, z1, and z2 such that base**x = 2**m * (z1 + z2). + Small arguments abs(x) < 1/(16*ln(base)) and extreme arguments + abs(x) > large/(ln(base)) (where large is the largest representable + floating point number) should be handled separately instead of calling + this function. This function is called by exp_amd, exp2_amd, exp10_amd, + cosh_amd and sinh_amd. */ +static inline void splitexp(double x, double logbase, + double thirtytwo_by_logbaseof2, + double logbaseof2_by_32_lead, + double logbaseof2_by_32_trail, + int *m, double *z1, double *z2) +{ + double q, r, r1, r2, f1, f2; + int n, j; + +/* Arrays two_to_jby32_lead_table and two_to_jby32_trail_table contain + leading and trailing parts respectively of precomputed + values of pow(2.0,j/32.0), for j = 0, 1, ..., 31. + two_to_jby32_lead_table contains the first 25 bits of precision, + and two_to_jby32_trail_table contains a further 53 bits precision. */ + + static const double two_to_jby32_lead_table[32] = { + 1.00000000000000000000e+00, /* 0x3ff0000000000000 */ + 1.02189713716506958008e+00, /* 0x3ff059b0d0000000 */ + 1.04427373409271240234e+00, /* 0x3ff0b55860000000 */ + 1.06714040040969848633e+00, /* 0x3ff11301d0000000 */ + 1.09050768613815307617e+00, /* 0x3ff172b830000000 */ + 1.11438673734664916992e+00, /* 0x3ff1d48730000000 */ + 1.13878858089447021484e+00, /* 0x3ff2387a60000000 */ + 1.16372483968734741211e+00, /* 0x3ff29e9df0000000 */ + 1.18920707702636718750e+00, /* 0x3ff306fe00000000 */ + 1.21524733304977416992e+00, /* 0x3ff371a730000000 */ + 1.24185776710510253906e+00, /* 0x3ff3dea640000000 */ + 1.26905095577239990234e+00, /* 0x3ff44e0860000000 */ + 1.29683953523635864258e+00, /* 0x3ff4bfdad0000000 */ + 1.32523661851882934570e+00, /* 0x3ff5342b50000000 */ + 1.35425549745559692383e+00, /* 0x3ff5ab07d0000000 */ + 1.38390988111495971680e+00, /* 0x3ff6247eb0000000 */ + 1.41421353816986083984e+00, /* 0x3ff6a09e60000000 */ + 1.44518077373504638672e+00, /* 0x3ff71f75e0000000 */ + 1.47682613134384155273e+00, /* 0x3ff7a11470000000 */ + 1.50916439294815063477e+00, /* 0x3ff8258990000000 */ + 1.54221081733703613281e+00, /* 0x3ff8ace540000000 */ + 1.57598084211349487305e+00, /* 0x3ff93737b0000000 */ + 1.61049032211303710938e+00, /* 0x3ff9c49180000000 */ + 1.64575546979904174805e+00, /* 0x3ffa5503b0000000 */ + 1.68179279565811157227e+00, /* 0x3ffae89f90000000 */ + 1.71861928701400756836e+00, /* 0x3ffb7f76f0000000 */ + 1.75625211000442504883e+00, /* 0x3ffc199bd0000000 */ + 1.79470902681350708008e+00, /* 0x3ffcb720d0000000 */ + 1.83400803804397583008e+00, /* 0x3ffd5818d0000000 */ + 1.87416762113571166992e+00, /* 0x3ffdfc9730000000 */ + 1.91520655155181884766e+00, /* 0x3ffea4afa0000000 */ + 1.95714408159255981445e+00}; /* 0x3fff507650000000 */ + + static const double two_to_jby32_trail_table[32] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 1.14890470981563546737e-08, /* 0x3e48ac2ba1d73e2a */ + 4.83347014379782142328e-08, /* 0x3e69f3121ec53172 */ + 2.67125131841396124714e-10, /* 0x3df25b50a4ebbf1b */ + 4.65271045830351350190e-08, /* 0x3e68faa2f5b9bef9 */ + 5.24924336638693782574e-09, /* 0x3e368b9aa7805b80 */ + 5.38622214388600821910e-08, /* 0x3e6ceac470cd83f6 */ + 1.90902301017041969782e-08, /* 0x3e547f7b84b09745 */ + 3.79763538792174980894e-08, /* 0x3e64636e2a5bd1ab */ + 2.69306947081946450986e-08, /* 0x3e5ceaa72a9c5154 */ + 4.49683815095311756138e-08, /* 0x3e682468446b6824 */ + 1.41933332021066904914e-09, /* 0x3e18624b40c4dbd0 */ + 1.94146510233556266402e-08, /* 0x3e54d8a89c750e5e */ + 2.46409119489264118569e-08, /* 0x3e5a753e077c2a0f */ + 4.94812958044698886494e-08, /* 0x3e6a90a852b19260 */ + 8.48872238075784476136e-10, /* 0x3e0d2ac258f87d03 */ + 2.42032342089579394887e-08, /* 0x3e59fcef32422cbf */ + 3.32420002333182569170e-08, /* 0x3e61d8bee7ba46e2 */ + 1.45956577586525322754e-08, /* 0x3e4f580c36bea881 */ + 3.46452721050003920866e-08, /* 0x3e62999c25159f11 */ + 8.07090469079979051284e-09, /* 0x3e415506dadd3e2a */ + 2.99439161340839520436e-09, /* 0x3e29b8bc9e8a0388 */ + 9.83621719880452147153e-09, /* 0x3e451f8480e3e236 */ + 8.35492309647188080486e-09, /* 0x3e41f12ae45a1224 */ + 3.48493175137966283582e-08, /* 0x3e62b5a75abd0e6a */ + 1.11084703472699692902e-08, /* 0x3e47daf237553d84 */ + 5.03688744342840346564e-08, /* 0x3e6b0aa538444196 */ + 4.81896001063495806249e-08, /* 0x3e69df20d22a0798 */ + 4.83653666334089557746e-08, /* 0x3e69f7490e4bb40b */ + 1.29745882314081237628e-08, /* 0x3e4bdcdaf5cb4656 */ + 9.84532844621636118964e-09, /* 0x3e452486cc2c7b9d */ + 4.25828404545651943883e-08}; /* 0x3e66dc8a80ce9f09 */ + + /* + Step 1. Reduce the argument. + + To perform argument reduction, we find the integer n such that + x = n * logbaseof2/32 + remainder, |remainder| <= logbaseof2/64. + n is defined by round-to-nearest-integer( x*32/logbaseof2 ) and + remainder by x - n*logbaseof2/32. The calculation of n is + straightforward whereas the computation of x - n*logbaseof2/32 + must be carried out carefully. + logbaseof2/32 is so represented in two pieces that + (1) logbaseof2/32 is known to extra precision, (2) the product + of n and the leading piece is a model number and is hence + calculated without error, and (3) the subtraction of the value + obtained in (2) from x is a model number and is hence again + obtained without error. + */ + + r = x * thirtytwo_by_logbaseof2; + /* Set n = nearest integer to r */ + /* This is faster on Hammer */ + if (r > 0) + n = (int)(r + 0.5); + else + n = (int)(r - 0.5); + + r1 = x - n * logbaseof2_by_32_lead; + r2 = - n * logbaseof2_by_32_trail; + + /* Set j = n mod 32: 5 mod 32 = 5, -5 mod 32 = 27, etc. */ + /* j = n % 32; + if (j < 0) j += 32; */ + j = n & 0x0000001f; + + f1 = two_to_jby32_lead_table[j]; + f2 = two_to_jby32_trail_table[j]; + + *m = (n - j) / 32; + + /* Step 2. The following is the core approximation. We approximate + exp(r1+r2)-1 by a polynomial. */ + + r1 *= logbase; r2 *= logbase; + + r = r1 + r2; + q = r1 + (r2 + + r*r*( 5.00000000000000008883e-01 + + r*( 1.66666666665260878863e-01 + + r*( 4.16666666662260795726e-02 + + r*( 8.33336798434219616221e-03 + + r*( 1.38889490863777199667e-03 )))))); + + /* Step 3. Function value reconstruction. + We now reconstruct the exponential of the input argument + so that exp(x) = 2**m * (z1 + z2). + The order of the computation below must be strictly observed. */ + + *z1 = f1; + *z2 = f2 + ((f1 + f2) * q); +} +#endif /* USE_SPLITEXP */ + + +#if defined(USE_SPLITEXPF) +/* Compute the values m, z1, and z2 such that base**x = 2**m * (z1 + z2). + Small arguments abs(x) < 1/(16*ln(base)) and extreme arguments + abs(x) > large/(ln(base)) (where large is the largest representable + floating point number) should be handled separately instead of calling + this function. This function is called by exp_amd, exp2_amd, exp10_amd, + cosh_amd and sinh_amd. */ +static inline void splitexpf(float x, float logbase, + float thirtytwo_by_logbaseof2, + float logbaseof2_by_32_lead, + float logbaseof2_by_32_trail, + int *m, float *z1, float *z2) +{ + float q, r, r1, r2, f1, f2; + int n, j; + +/* Arrays two_to_jby32_lead_table and two_to_jby32_trail_table contain + leading and trailing parts respectively of precomputed + values of pow(2.0,j/32.0), for j = 0, 1, ..., 31. + two_to_jby32_lead_table contains the first 10 bits of precision, + and two_to_jby32_trail_table contains a further 24 bits precision. */ + + static const float two_to_jby32_lead_table[32] = { + 1.0000000000E+00F, /* 0x3F800000 */ + 1.0214843750E+00F, /* 0x3F82C000 */ + 1.0429687500E+00F, /* 0x3F858000 */ + 1.0664062500E+00F, /* 0x3F888000 */ + 1.0898437500E+00F, /* 0x3F8B8000 */ + 1.1132812500E+00F, /* 0x3F8E8000 */ + 1.1386718750E+00F, /* 0x3F91C000 */ + 1.1621093750E+00F, /* 0x3F94C000 */ + 1.1875000000E+00F, /* 0x3F980000 */ + 1.2148437500E+00F, /* 0x3F9B8000 */ + 1.2402343750E+00F, /* 0x3F9EC000 */ + 1.2675781250E+00F, /* 0x3FA24000 */ + 1.2949218750E+00F, /* 0x3FA5C000 */ + 1.3242187500E+00F, /* 0x3FA98000 */ + 1.3535156250E+00F, /* 0x3FAD4000 */ + 1.3828125000E+00F, /* 0x3FB10000 */ + 1.4140625000E+00F, /* 0x3FB50000 */ + 1.4433593750E+00F, /* 0x3FB8C000 */ + 1.4765625000E+00F, /* 0x3FBD0000 */ + 1.5078125000E+00F, /* 0x3FC10000 */ + 1.5410156250E+00F, /* 0x3FC54000 */ + 1.5742187500E+00F, /* 0x3FC98000 */ + 1.6093750000E+00F, /* 0x3FCE0000 */ + 1.6445312500E+00F, /* 0x3FD28000 */ + 1.6816406250E+00F, /* 0x3FD74000 */ + 1.7167968750E+00F, /* 0x3FDBC000 */ + 1.7558593750E+00F, /* 0x3FE0C000 */ + 1.7929687500E+00F, /* 0x3FE58000 */ + 1.8339843750E+00F, /* 0x3FEAC000 */ + 1.8730468750E+00F, /* 0x3FEFC000 */ + 1.9140625000E+00F, /* 0x3FF50000 */ + 1.9570312500E+00F}; /* 0x3FFA8000 */ + + static const float two_to_jby32_trail_table[32] = { + 0.0000000000E+00F, /* 0x00000000 */ + 4.1277357377E-04F, /* 0x39D86988 */ + 1.3050324051E-03F, /* 0x3AAB0D9F */ + 7.3415064253E-04F, /* 0x3A407404 */ + 6.6398258787E-04F, /* 0x3A2E0F1E */ + 1.1054925853E-03F, /* 0x3A90E62D */ + 1.1675967835E-04F, /* 0x38F4DCE0 */ + 1.6154836630E-03F, /* 0x3AD3BEA3 */ + 1.7071149778E-03F, /* 0x3ADFC146 */ + 4.0360994171E-04F, /* 0x39D39B9C */ + 1.6234370414E-03F, /* 0x3AD4C982 */ + 1.4728321694E-03F, /* 0x3AC10C0C */ + 1.9176795613E-03F, /* 0x3AFB5AA6 */ + 1.0178930825E-03F, /* 0x3A856AD3 */ + 7.3992193211E-04F, /* 0x3A41F752 */ + 1.0973819299E-03F, /* 0x3A8FD607 */ + 1.5106226783E-04F, /* 0x391E6678 */ + 1.8214319134E-03F, /* 0x3AEEBD1D */ + 2.6364589576E-04F, /* 0x398A39F4 */ + 1.3519275235E-03F, /* 0x3AB13329 */ + 1.1952003697E-03F, /* 0x3A9CA845 */ + 1.7620950239E-03F, /* 0x3AE6F619 */ + 1.1153318919E-03F, /* 0x3A923054 */ + 1.2242280645E-03F, /* 0x3AA07647 */ + 1.5220546629E-04F, /* 0x391F9958 */ + 1.8224230735E-03F, /* 0x3AEEDE5F */ + 3.9278529584E-04F, /* 0x39CDEEC0 */ + 1.7403248930E-03F, /* 0x3AE41B9D */ + 2.3711356334E-05F, /* 0x37C6E7C0 */ + 1.1207590578E-03F, /* 0x3A92E66F */ + 1.1440613307E-03F, /* 0x3A95F454 */ + 1.1287408415E-04F}; /* 0x38ECB6D0 */ + + /* + Step 1. Reduce the argument. + + To perform argument reduction, we find the integer n such that + x = n * logbaseof2/32 + remainder, |remainder| <= logbaseof2/64. + n is defined by round-to-nearest-integer( x*32/logbaseof2 ) and + remainder by x - n*logbaseof2/32. The calculation of n is + straightforward whereas the computation of x - n*logbaseof2/32 + must be carried out carefully. + logbaseof2/32 is so represented in two pieces that + (1) logbaseof2/32 is known to extra precision, (2) the product + of n and the leading piece is a model number and is hence + calculated without error, and (3) the subtraction of the value + obtained in (2) from x is a model number and is hence again + obtained without error. + */ + + r = x * thirtytwo_by_logbaseof2; + /* Set n = nearest integer to r */ + /* This is faster on Hammer */ + if (r > 0) + n = (int)(r + 0.5F); + else + n = (int)(r - 0.5F); + + r1 = x - n * logbaseof2_by_32_lead; + r2 = - n * logbaseof2_by_32_trail; + + /* Set j = n mod 32: 5 mod 32 = 5, -5 mod 32 = 27, etc. */ + /* j = n % 32; + if (j < 0) j += 32; */ + j = n & 0x0000001f; + + f1 = two_to_jby32_lead_table[j]; + f2 = two_to_jby32_trail_table[j]; + + *m = (n - j) / 32; + + /* Step 2. The following is the core approximation. We approximate + exp(r1+r2)-1 by a polynomial. */ + + r1 *= logbase; r2 *= logbase; + + r = r1 + r2; + q = r1 + (r2 + + r*r*( 5.00000000000000008883e-01F + + r*( 1.66666666665260878863e-01F ))); + + /* Step 3. Function value reconstruction. + We now reconstruct the exponential of the input argument + so that exp(x) = 2**m * (z1 + z2). + The order of the computation below must be strictly observed. */ + + *z1 = f1; + *z2 = f2 + ((f1 + f2) * q); +} +#endif /* SPLITEXPF */ + + +#if defined(USE_SCALEUPDOUBLE1024) +/* Scales up a double (normal or denormal) whose bit pattern is given + as ux by 2**1024. There are no checks that the input number is + scalable by that amount. */ +static inline void scaleUpDouble1024(unsigned long long ux, unsigned long long *ur) +{ + unsigned long long uy; + double y; + + if ((ux & EXPBITS_DP64) == 0) + { + /* ux is denormalised */ + PUT_BITS_DP64(ux | 0x4010000000000000, y); + if (ux & SIGNBIT_DP64) + y += 4.0; + else + y -= 4.0; + GET_BITS_DP64(y, uy); + } + else + /* ux is normal */ + uy = ux + 0x4000000000000000; + + *ur = uy; + return; +} + +#endif /* SCALEUPDOUBLE1024 */ + + +#if defined(USE_SCALEDOWNDOUBLE) +/* Scales down a double whose bit pattern is given as ux by 2**k. + There are no checks that the input number is scalable by that amount. */ +static inline void scaleDownDouble(unsigned long long ux, int k, + unsigned long long *ur) +{ + unsigned long long uy, uk, ax, xsign; + int n, shift; + xsign = ux & SIGNBIT_DP64; + ax = ux & ~SIGNBIT_DP64; + n = (int)((ax & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - k; + if (n > 0) + { + uk = (unsigned long long)n << EXPSHIFTBITS_DP64; + uy = (ax & ~EXPBITS_DP64) | uk; + } + else + { + uy = (ax & ~EXPBITS_DP64) | 0x0010000000000000; + shift = (1 - n); + if (shift > MANTLENGTH_DP64 + 1) + /* Sigh. Shifting works mod 64 so be careful not to shift too much */ + uy = 0; + else + { + /* Make sure we round the result */ + uy >>= shift - 1; + uy = (uy >> 1) + (uy & 1); + } + } + *ur = uy | xsign; +} + +#endif /* SCALEDOWNDOUBLE */ + + +#if defined(USE_SCALEUPFLOAT128) +/* Scales up a float (normal or denormal) whose bit pattern is given + as ux by 2**128. There are no checks that the input number is + scalable by that amount. */ +static inline void scaleUpFloat128(unsigned int ux, unsigned int *ur) +{ + unsigned int uy; + float y; + + if ((ux & EXPBITS_SP32) == 0) + { + /* ux is denormalised */ + PUT_BITS_SP32(ux | 0x40800000, y); + /* Compensate for the implicit bit just added */ + if (ux & SIGNBIT_SP32) + y += 4.0F; + else + y -= 4.0F; + GET_BITS_SP32(y, uy); + } + else + /* ux is normal */ + uy = ux + 0x40000000; + *ur = uy; +} +#endif /* SCALEUPFLOAT128 */ + + +#if defined(USE_SCALEDOWNFLOAT) +/* Scales down a float whose bit pattern is given as ux by 2**k. + There are no checks that the input number is scalable by that amount. */ +static inline void scaleDownFloat(unsigned int ux, int k, + unsigned int *ur) +{ + unsigned int uy, uk, ax, xsign; + int n, shift; + + xsign = ux & SIGNBIT_SP32; + ax = ux & ~SIGNBIT_SP32; + n = ((ax & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - k; + if (n > 0) + { + uk = (unsigned int)n << EXPSHIFTBITS_SP32; + uy = (ax & ~EXPBITS_SP32) | uk; + } + else + { + uy = (ax & ~EXPBITS_SP32) | 0x00800000; + shift = (1 - n); + if (shift > MANTLENGTH_SP32 + 1) + /* Sigh. Shifting works mod 32 so be careful not to shift too much */ + uy = 0; + else + { + /* Make sure we round the result */ + uy >>= shift - 1; + uy = (uy >> 1) + (uy & 1); + } + } + *ur = uy | xsign; +} +#endif /* SCALEDOWNFLOAT */ + + +#if defined(USE_SQRT_AMD_INLINE) +static inline double sqrt_amd_inline(double x) +{ + /* + Computes the square root of x. + + The calculation is carried out in three steps. + + Step 1. Reduction. + The input argument is scaled to the interval [1, 4) by + computing + x = 2^e * y, where y in [1,4). + Furthermore y is decomposed as y = c + t where + c = 1 + j/32, j = 0,1,..,96; and |t| <= 1/64. + + Step 2. Approximation. + An approximation q = sqrt(1 + (t/c)) - 1 is obtained + from a basic series expansion using precomputed values + stored in rt_jby32_lead_table_dbl and rt_jby32_trail_table_dbl. + + Step 3. Reconstruction. + The value of sqrt(x) is reconstructed via + sqrt(x) = 2^(e/2) * sqrt(y) + = 2^(e/2) * sqrt(c) * sqrt(y/c) + = 2^(e/2) * sqrt(c) * sqrt(1 + t/c) + = 2^(e/2) * [ sqrt(c) + sqrt(c)*q ] + */ + + unsigned long long ux, ax, u; + double r1, r2, c, y, p, q, r, twop, z, rtc, rtc_lead, rtc_trail; + int e, denorm = 0, index; + +/* Arrays rt_jby32_lead_table_dbl and rt_jby32_trail_table_dbl contain + leading and trailing parts respectively of precomputed + values of sqrt(j/32), for j = 32, 33, ..., 128. + rt_jby32_lead_table_dbl contains the first 21 bits of precision, + and rt_jby32_trail_table_dbl contains a further 53 bits precision. */ + + static const double rt_jby32_lead_table_dbl[97] = { + 1.00000000000000000000e+00, /* 0x3ff0000000000000 */ + 1.01550388336181640625e+00, /* 0x3ff03f8100000000 */ + 1.03077602386474609375e+00, /* 0x3ff07e0f00000000 */ + 1.04582500457763671875e+00, /* 0x3ff0bbb300000000 */ + 1.06065940856933593750e+00, /* 0x3ff0f87600000000 */ + 1.07528972625732421875e+00, /* 0x3ff1346300000000 */ + 1.08972454071044921875e+00, /* 0x3ff16f8300000000 */ + 1.10396957397460937500e+00, /* 0x3ff1a9dc00000000 */ + 1.11803340911865234375e+00, /* 0x3ff1e37700000000 */ + 1.13192272186279296875e+00, /* 0x3ff21c5b00000000 */ + 1.14564323425292968750e+00, /* 0x3ff2548e00000000 */ + 1.15920162200927734375e+00, /* 0x3ff28c1700000000 */ + 1.17260360717773437500e+00, /* 0x3ff2c2fc00000000 */ + 1.18585395812988281250e+00, /* 0x3ff2f94200000000 */ + 1.19895744323730468750e+00, /* 0x3ff32eee00000000 */ + 1.21191978454589843750e+00, /* 0x3ff3640600000000 */ + 1.22474479675292968750e+00, /* 0x3ff3988e00000000 */ + 1.23743629455566406250e+00, /* 0x3ff3cc8a00000000 */ + 1.25000000000000000000e+00, /* 0x3ff4000000000000 */ + 1.26243782043457031250e+00, /* 0x3ff432f200000000 */ + 1.27475452423095703125e+00, /* 0x3ff4656500000000 */ + 1.28695297241210937500e+00, /* 0x3ff4975c00000000 */ + 1.29903793334960937500e+00, /* 0x3ff4c8dc00000000 */ + 1.31101036071777343750e+00, /* 0x3ff4f9e600000000 */ + 1.32287502288818359375e+00, /* 0x3ff52a7f00000000 */ + 1.33463478088378906250e+00, /* 0x3ff55aaa00000000 */ + 1.34629058837890625000e+00, /* 0x3ff58a6800000000 */ + 1.35784721374511718750e+00, /* 0x3ff5b9be00000000 */ + 1.36930561065673828125e+00, /* 0x3ff5e8ad00000000 */ + 1.38066959381103515625e+00, /* 0x3ff6173900000000 */ + 1.39194107055664062500e+00, /* 0x3ff6456400000000 */ + 1.40312099456787109375e+00, /* 0x3ff6732f00000000 */ + 1.41421318054199218750e+00, /* 0x3ff6a09e00000000 */ + 1.42521858215332031250e+00, /* 0x3ff6cdb200000000 */ + 1.43614006042480468750e+00, /* 0x3ff6fa6e00000000 */ + 1.44697952270507812500e+00, /* 0x3ff726d400000000 */ + 1.45773792266845703125e+00, /* 0x3ff752e500000000 */ + 1.46841716766357421875e+00, /* 0x3ff77ea300000000 */ + 1.47901916503906250000e+00, /* 0x3ff7aa1000000000 */ + 1.48954677581787109375e+00, /* 0x3ff7d52f00000000 */ + 1.50000000000000000000e+00, /* 0x3ff8000000000000 */ + 1.51038074493408203125e+00, /* 0x3ff82a8500000000 */ + 1.52068996429443359375e+00, /* 0x3ff854bf00000000 */ + 1.53093051910400390625e+00, /* 0x3ff87eb100000000 */ + 1.54110336303710937500e+00, /* 0x3ff8a85c00000000 */ + 1.55120849609375000000e+00, /* 0x3ff8d1c000000000 */ + 1.56124877929687500000e+00, /* 0x3ff8fae000000000 */ + 1.57122516632080078125e+00, /* 0x3ff923bd00000000 */ + 1.58113861083984375000e+00, /* 0x3ff94c5800000000 */ + 1.59099006652832031250e+00, /* 0x3ff974b200000000 */ + 1.60078048706054687500e+00, /* 0x3ff99ccc00000000 */ + 1.61051177978515625000e+00, /* 0x3ff9c4a800000000 */ + 1.62018489837646484375e+00, /* 0x3ff9ec4700000000 */ + 1.62979984283447265625e+00, /* 0x3ffa13a900000000 */ + 1.63935947418212890625e+00, /* 0x3ffa3ad100000000 */ + 1.64886283874511718750e+00, /* 0x3ffa61be00000000 */ + 1.65831184387207031250e+00, /* 0x3ffa887200000000 */ + 1.66770744323730468750e+00, /* 0x3ffaaeee00000000 */ + 1.67705059051513671875e+00, /* 0x3ffad53300000000 */ + 1.68634128570556640625e+00, /* 0x3ffafb4100000000 */ + 1.69558238983154296875e+00, /* 0x3ffb211b00000000 */ + 1.70477199554443359375e+00, /* 0x3ffb46bf00000000 */ + 1.71391296386718750000e+00, /* 0x3ffb6c3000000000 */ + 1.72300529479980468750e+00, /* 0x3ffb916e00000000 */ + 1.73204994201660156250e+00, /* 0x3ffbb67a00000000 */ + 1.74104785919189453125e+00, /* 0x3ffbdb5500000000 */ + 1.75000000000000000000e+00, /* 0x3ffc000000000000 */ + 1.75890541076660156250e+00, /* 0x3ffc247a00000000 */ + 1.76776695251464843750e+00, /* 0x3ffc48c600000000 */ + 1.77658367156982421875e+00, /* 0x3ffc6ce300000000 */ + 1.78535652160644531250e+00, /* 0x3ffc90d200000000 */ + 1.79408740997314453125e+00, /* 0x3ffcb49500000000 */ + 1.80277538299560546875e+00, /* 0x3ffcd82b00000000 */ + 1.81142139434814453125e+00, /* 0x3ffcfb9500000000 */ + 1.82002735137939453125e+00, /* 0x3ffd1ed500000000 */ + 1.82859230041503906250e+00, /* 0x3ffd41ea00000000 */ + 1.83711719512939453125e+00, /* 0x3ffd64d500000000 */ + 1.84560203552246093750e+00, /* 0x3ffd879600000000 */ + 1.85404872894287109375e+00, /* 0x3ffdaa2f00000000 */ + 1.86245727539062500000e+00, /* 0x3ffdcca000000000 */ + 1.87082862854003906250e+00, /* 0x3ffdeeea00000000 */ + 1.87916183471679687500e+00, /* 0x3ffe110c00000000 */ + 1.88745784759521484375e+00, /* 0x3ffe330700000000 */ + 1.89571857452392578125e+00, /* 0x3ffe54dd00000000 */ + 1.90394306182861328125e+00, /* 0x3ffe768d00000000 */ + 1.91213226318359375000e+00, /* 0x3ffe981800000000 */ + 1.92028617858886718750e+00, /* 0x3ffeb97e00000000 */ + 1.92840576171875000000e+00, /* 0x3ffedac000000000 */ + 1.93649101257324218750e+00, /* 0x3ffefbde00000000 */ + 1.94454288482666015625e+00, /* 0x3fff1cd900000000 */ + 1.95256233215332031250e+00, /* 0x3fff3db200000000 */ + 1.96054744720458984375e+00, /* 0x3fff5e6700000000 */ + 1.96850109100341796875e+00, /* 0x3fff7efb00000000 */ + 1.97642326354980468750e+00, /* 0x3fff9f6e00000000 */ + 1.98431301116943359375e+00, /* 0x3fffbfbf00000000 */ + 1.99217128753662109375e+00, /* 0x3fffdfef00000000 */ + 2.00000000000000000000e+00}; /* 0x4000000000000000 */ + + static const double rt_jby32_trail_table_dbl[97] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 9.17217678638807524014e-07, /* 0x3eaec6d70177881c */ + 3.82539669043705364790e-07, /* 0x3e99abfb41bd6b24 */ + 2.85899577162227138140e-08, /* 0x3e5eb2bf6bab55a2 */ + 7.63210485349101216659e-07, /* 0x3ea99bed9b2d8d0c */ + 9.32123004127716212874e-07, /* 0x3eaf46e029c1b296 */ + 1.95174719169309219157e-07, /* 0x3e8a3226fc42f30c */ + 5.34316371481845492427e-07, /* 0x3ea1edbe20701d73 */ + 5.79631242504454563052e-07, /* 0x3ea372fe94f82be7 */ + 4.20404384109571705948e-07, /* 0x3e9c367e08e7bb06 */ + 6.89486030314147010716e-07, /* 0x3ea722a3d0a66608 */ + 6.89927685625314560328e-07, /* 0x3ea7266f067ca1d6 */ + 3.32778123013641425828e-07, /* 0x3e965515a9b34850 */ + 1.64433259436999584387e-07, /* 0x3e8611e23ef6c1bd */ + 4.37590875197899335723e-07, /* 0x3e9d5dc1059ed8e7 */ + 1.79808183816018617413e-07, /* 0x3e88222982d0e4f4 */ + 7.46386593615986477624e-08, /* 0x3e7409212e7d0322 */ + 5.72520794105201454728e-07, /* 0x3ea335ea8a5fcf39 */ + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 2.96860689431670420344e-07, /* 0x3e93ec071e938bfe */ + 3.54167239176257065345e-07, /* 0x3e97c48bfd9862c6 */ + 7.95211265664474710063e-07, /* 0x3eaaaed010f74671 */ + 1.72327048595145565621e-07, /* 0x3e87211cbfeb62e0 */ + 6.99494915996239297020e-07, /* 0x3ea7789d9660e72d */ + 6.32644111701500844315e-07, /* 0x3ea53a5f1d36f1cf */ + 6.20124838851440463844e-10, /* 0x3e054eacff2057dc */ + 6.13404719757812629969e-07, /* 0x3ea4951b3e6a83cc */ + 3.47654909777986407387e-07, /* 0x3e9754aa76884c66 */ + 7.83106177002392475763e-07, /* 0x3eaa46d4b1de1074 */ + 5.33337372440526357008e-07, /* 0x3ea1e55548f92635 */ + 2.01508648555298681765e-08, /* 0x3e55a3070dd17788 */ + 5.25472356925843939587e-07, /* 0x3ea1a1c5eedb0801 */ + 3.81831102861301692797e-07, /* 0x3e999fcef32422cc */ + 6.99220602161420018738e-07, /* 0x3ea776425d6b0199 */ + 6.01209702477462624811e-07, /* 0x3ea42c5a1e0191a2 */ + 9.01437000591944740554e-08, /* 0x3e7832a0bdff1327 */ + 5.10428680864685379950e-08, /* 0x3e6b674743636676 */ + 3.47895267104621031421e-07, /* 0x3e9758cb90d2f714 */ + 7.80735841510641848628e-07, /* 0x3eaa3278459cde25 */ + 1.35158752025506517690e-07, /* 0x3e822404f4a103ee */ + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 1.76523947728535489812e-09, /* 0x3e1e539af6892ac5 */ + 6.68280121328499932183e-07, /* 0x3ea66c7b872c9cd0 */ + 5.70135482405123276616e-07, /* 0x3ea3216d2f43887d */ + 1.37705134737562525897e-07, /* 0x3e827b832cbedc0e */ + 7.09655107074516613672e-07, /* 0x3ea7cfe41579091d */ + 7.20302724551461693011e-07, /* 0x3ea82b5a713c490a */ + 4.69926266058212796694e-07, /* 0x3e9f8945932d872e */ + 2.19244345915999437026e-07, /* 0x3e8d6d2da9490251 */ + 1.91141411617401877927e-07, /* 0x3e89a791a3114e4a */ + 5.72297665296622053774e-07, /* 0x3ea333ffe005988d */ + 5.61055484436830560103e-07, /* 0x3ea2d36e0ed49ab1 */ + 2.76225500213991506100e-07, /* 0x3e92898498f55f9e */ + 7.58466189522395692908e-07, /* 0x3ea9732cca1032a3 */ + 1.56893371256836029827e-07, /* 0x3e850ed0b02a22d2 */ + 4.06038997708867066507e-07, /* 0x3e9b3fb265b1e40a */ + 5.51305629612057435809e-07, /* 0x3ea27fade682d1de */ + 5.64778487026561123207e-07, /* 0x3ea2f36906f707ba */ + 3.92609705553556897517e-07, /* 0x3e9a58fbbee883b6 */ + 9.09698438776943827802e-07, /* 0x3eae864005bca6d7 */ + 1.05949774066016139743e-07, /* 0x3e7c70d02300f263 */ + 7.16578798392844784244e-07, /* 0x3ea80b5d712d8e3e */ + 6.86233073531233972561e-07, /* 0x3ea706b27cc7d390 */ + 7.99211473033494452908e-07, /* 0x3eaad12c9d849a97 */ + 8.65552275731027456121e-07, /* 0x3ead0b09954e764b */ + 6.75456120386058448618e-07, /* 0x3ea6aa1fb7826cbd */ + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 4.99167184520462138743e-07, /* 0x3ea0bfd03f46763c */ + 4.51720373502110930296e-10, /* 0x3dff0abfb4adfb9e */ + 1.28874162718371367439e-07, /* 0x3e814c151f991b2e */ + 5.85529267186999798656e-07, /* 0x3ea3a5a879b09292 */ + 1.01827770937125531924e-07, /* 0x3e7b558d173f9796 */ + 2.54736389177809626508e-07, /* 0x3e9118567cd83fb8 */ + 6.98925535290464831294e-07, /* 0x3ea773b981896751 */ + 1.20940735036524314513e-07, /* 0x3e803b7df49f48a8 */ + 5.43759351196479689657e-08, /* 0x3e6d315f22491900 */ + 1.11957989042397958409e-07, /* 0x3e7e0db1c5bb84b2 */ + 8.47006714134442661218e-07, /* 0x3eac6bbb7644ff76 */ + 8.92831044643427836228e-07, /* 0x3eadf55c3afec01f */ + 7.77828292464916501663e-07, /* 0x3eaa197e81034da3 */ + 6.48469316302918797451e-08, /* 0x3e71683f4920555d */ + 2.12579816658859849140e-07, /* 0x3e8c882fd78bb0b0 */ + 7.61222472580559138435e-07, /* 0x3ea98ad9eb7b83ec */ + 2.86488961857314189607e-07, /* 0x3e9339d7c7777273 */ + 2.14637363790165363515e-07, /* 0x3e8ccee237cae6fe */ + 5.44137005612605847831e-08, /* 0x3e6d368fe324a146 */ + 2.58378284856442408413e-07, /* 0x3e9156e7b6d99b45 */ + 3.15848939061134843091e-07, /* 0x3e95323e5310b5c1 */ + 6.60530466255089632309e-07, /* 0x3ea629e9db362f5d */ + 7.63436345535852301127e-07, /* 0x3ea99dde4728d7ec */ + 8.68233432860324345268e-08, /* 0x3e774e746878544d */ + 9.45465175398023087082e-07, /* 0x3eafb97be873a87d */ + 8.77499534786171267246e-07, /* 0x3ead71a9e23c2f63 */ + 2.74055432394999316135e-07, /* 0x3e92643c89cda173 */ + 4.72129009349126213532e-07, /* 0x3e9faf1d57a4d56c */ + 8.93777032327078947306e-07, /* 0x3eadfd7c7ab7b282 */ + 0.00000000000000000000e+00}; /* 0x0000000000000000 */ + + + /* Handle special arguments first */ + + GET_BITS_DP64(x, ux); + ax = ux & (~SIGNBIT_DP64); + + if(ax >= 0x7ff0000000000000) + { + /* x is either NaN or infinity */ + if (ux & MANTBITS_DP64) + /* x is NaN */ + return x + x; /* Raise invalid if it is a signalling NaN */ + else if (ux & SIGNBIT_DP64) + /* x is negative infinity */ + return nan_with_flags(AMD_F_INVALID); + else + /* x is positive infinity */ + return x; + } + else if (ux & SIGNBIT_DP64) + { + /* x is negative. */ + if (ux == SIGNBIT_DP64) + /* Handle negative zero first */ + return x; + else + return nan_with_flags(AMD_F_INVALID); + } + else if (ux <= 0x000fffffffffffff) + { + /* x is denormalised or zero */ + if (ux == 0) + /* x is zero */ + return x; + else + { + /* x is denormalised; scale it up */ + /* Normalize x by increasing the exponent by 60 + and subtracting a correction to account for the implicit + bit. This replaces a slow denormalized + multiplication by a fast normal subtraction. */ + static const double corr = 2.5653355008114851558350183e-290; /* 0x03d0000000000000 */ + denorm = 1; + GET_BITS_DP64(x, ux); + PUT_BITS_DP64(ux | 0x03d0000000000000, x); + x -= corr; + GET_BITS_DP64(x, ux); + } + } + + /* Main algorithm */ + + /* + Find y and e such that x = 2^e * y, where y in [1,4). + This is done using an in-lined variant of splitDouble, + which also ensures that e is even. + */ + y = x; + ux &= EXPBITS_DP64; + ux >>= EXPSHIFTBITS_DP64; + if (ux & 1) + { + GET_BITS_DP64(y, u); + u &= (SIGNBIT_DP64 | MANTBITS_DP64); + u |= ONEEXPBITS_DP64; + PUT_BITS_DP64(u, y); + e = ux - EXPBIAS_DP64; + } + else + { + GET_BITS_DP64(y, u); + u &= (SIGNBIT_DP64 | MANTBITS_DP64); + u |= TWOEXPBITS_DP64; + PUT_BITS_DP64(u, y); + e = ux - EXPBIAS_DP64 - 1; + } + + + /* Find the index of the sub-interval of [1,4) in which y lies. */ + + index = (int)(32.0*y+0.5); + + /* Look up the table values and compute c and r = c/t */ + + rtc_lead = rt_jby32_lead_table_dbl[index-32]; + rtc_trail = rt_jby32_trail_table_dbl[index-32]; + c = 0.03125*index; + r = (y - c)/c; + + /* + Find q = sqrt(1+r) - 1. + From one step of Newton on (q+1)^2 = 1+r + */ + + p = r*0.5 - r*r*(0.1250079870 - r*(0.6250522999E-01)); + twop = p + p; + q = p - (p*p + (twop - r))/(twop + 2.0); + + /* Reconstruction */ + + rtc = rtc_lead + rtc_trail; + e >>= 1; /* e = e/2 */ + z = rtc_lead + (rtc*q+rtc_trail); + + if (denorm) + { + /* Scale by 2**(e-30) */ + PUT_BITS_DP64(((long long)(e - 30) + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, r); + z *= r; + } + else + { + /* Scale by 2**e */ + PUT_BITS_DP64(((long long)e + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, r); + z *= r; + } + + return z; + +} +#endif /* SQRT_AMD_INLINE */ + +#if defined(USE_SQRTF_AMD_INLINE) + +static inline float sqrtf_amd_inline(float x) +{ + /* + Computes the square root of x. + + The calculation is carried out in three steps. + + Step 1. Reduction. + The input argument is scaled to the interval [1, 4) by + computing + x = 2^e * y, where y in [1,4). + Furthermore y is decomposed as y = c + t where + c = 1 + j/32, j = 0,1,..,96; and |t| <= 1/64. + + Step 2. Approximation. + An approximation q = sqrt(1 + (t/c)) - 1 is obtained + from a basic series expansion using precomputed values + stored in rt_jby32_lead_table_float and rt_jby32_trail_table_float. + + Step 3. Reconstruction. + The value of sqrt(x) is reconstructed via + sqrt(x) = 2^(e/2) * sqrt(y) + = 2^(e/2) * sqrt(c) * sqrt(y/c) + = 2^(e/2) * sqrt(c) * sqrt(1 + t/c) + = 2^(e/2) * [ sqrt(c) + sqrt(c)*q ] + */ + + unsigned int ux, ax, u; + float r1, r2, c, y, p, q, r, twop, z, rtc, rtc_lead, rtc_trail; + int e, denorm = 0, index; + +/* Arrays rt_jby32_lead_table_float and rt_jby32_trail_table_float contain + leading and trailing parts respectively of precomputed + values of sqrt(j/32), for j = 32, 33, ..., 128. + rt_jby32_lead_table_float contains the first 13 bits of precision, + and rt_jby32_trail_table_float contains a further 24 bits precision. */ + +static const float rt_jby32_lead_table_float[97] = { + 1.00000000000000000000e+00F, /* 0x3f800000 */ + 1.01538085937500000000e+00F, /* 0x3f81f800 */ + 1.03076171875000000000e+00F, /* 0x3f83f000 */ + 1.04565429687500000000e+00F, /* 0x3f85d800 */ + 1.06054687500000000000e+00F, /* 0x3f87c000 */ + 1.07519531250000000000e+00F, /* 0x3f89a000 */ + 1.08959960937500000000e+00F, /* 0x3f8b7800 */ + 1.10375976562500000000e+00F, /* 0x3f8d4800 */ + 1.11791992187500000000e+00F, /* 0x3f8f1800 */ + 1.13183593750000000000e+00F, /* 0x3f90e000 */ + 1.14550781250000000000e+00F, /* 0x3f92a000 */ + 1.15917968750000000000e+00F, /* 0x3f946000 */ + 1.17236328125000000000e+00F, /* 0x3f961000 */ + 1.18579101562500000000e+00F, /* 0x3f97c800 */ + 1.19873046875000000000e+00F, /* 0x3f997000 */ + 1.21191406250000000000e+00F, /* 0x3f9b2000 */ + 1.22460937500000000000e+00F, /* 0x3f9cc000 */ + 1.23730468750000000000e+00F, /* 0x3f9e6000 */ + 1.25000000000000000000e+00F, /* 0x3fa00000 */ + 1.26220703125000000000e+00F, /* 0x3fa19000 */ + 1.27465820312500000000e+00F, /* 0x3fa32800 */ + 1.28686523437500000000e+00F, /* 0x3fa4b800 */ + 1.29882812500000000000e+00F, /* 0x3fa64000 */ + 1.31079101562500000000e+00F, /* 0x3fa7c800 */ + 1.32275390625000000000e+00F, /* 0x3fa95000 */ + 1.33447265625000000000e+00F, /* 0x3faad000 */ + 1.34619140625000000000e+00F, /* 0x3fac5000 */ + 1.35766601562500000000e+00F, /* 0x3fadc800 */ + 1.36914062500000000000e+00F, /* 0x3faf4000 */ + 1.38061523437500000000e+00F, /* 0x3fb0b800 */ + 1.39184570312500000000e+00F, /* 0x3fb22800 */ + 1.40307617187500000000e+00F, /* 0x3fb39800 */ + 1.41406250000000000000e+00F, /* 0x3fb50000 */ + 1.42504882812500000000e+00F, /* 0x3fb66800 */ + 1.43603515625000000000e+00F, /* 0x3fb7d000 */ + 1.44677734375000000000e+00F, /* 0x3fb93000 */ + 1.45751953125000000000e+00F, /* 0x3fba9000 */ + 1.46826171875000000000e+00F, /* 0x3fbbf000 */ + 1.47900390625000000000e+00F, /* 0x3fbd5000 */ + 1.48950195312500000000e+00F, /* 0x3fbea800 */ + 1.50000000000000000000e+00F, /* 0x3fc00000 */ + 1.51025390625000000000e+00F, /* 0x3fc15000 */ + 1.52050781250000000000e+00F, /* 0x3fc2a000 */ + 1.53076171875000000000e+00F, /* 0x3fc3f000 */ + 1.54101562500000000000e+00F, /* 0x3fc54000 */ + 1.55102539062500000000e+00F, /* 0x3fc68800 */ + 1.56103515625000000000e+00F, /* 0x3fc7d000 */ + 1.57104492187500000000e+00F, /* 0x3fc91800 */ + 1.58105468750000000000e+00F, /* 0x3fca6000 */ + 1.59082031250000000000e+00F, /* 0x3fcba000 */ + 1.60058593750000000000e+00F, /* 0x3fcce000 */ + 1.61035156250000000000e+00F, /* 0x3fce2000 */ + 1.62011718750000000000e+00F, /* 0x3fcf6000 */ + 1.62963867187500000000e+00F, /* 0x3fd09800 */ + 1.63916015625000000000e+00F, /* 0x3fd1d000 */ + 1.64868164062500000000e+00F, /* 0x3fd30800 */ + 1.65820312500000000000e+00F, /* 0x3fd44000 */ + 1.66748046875000000000e+00F, /* 0x3fd57000 */ + 1.67700195312500000000e+00F, /* 0x3fd6a800 */ + 1.68627929687500000000e+00F, /* 0x3fd7d800 */ + 1.69555664062500000000e+00F, /* 0x3fd90800 */ + 1.70458984375000000000e+00F, /* 0x3fda3000 */ + 1.71386718750000000000e+00F, /* 0x3fdb6000 */ + 1.72290039062500000000e+00F, /* 0x3fdc8800 */ + 1.73193359375000000000e+00F, /* 0x3fddb000 */ + 1.74096679687500000000e+00F, /* 0x3fded800 */ + 1.75000000000000000000e+00F, /* 0x3fe00000 */ + 1.75878906250000000000e+00F, /* 0x3fe12000 */ + 1.76757812500000000000e+00F, /* 0x3fe24000 */ + 1.77636718750000000000e+00F, /* 0x3fe36000 */ + 1.78515625000000000000e+00F, /* 0x3fe48000 */ + 1.79394531250000000000e+00F, /* 0x3fe5a000 */ + 1.80273437500000000000e+00F, /* 0x3fe6c000 */ + 1.81127929687500000000e+00F, /* 0x3fe7d800 */ + 1.81982421875000000000e+00F, /* 0x3fe8f000 */ + 1.82836914062500000000e+00F, /* 0x3fea0800 */ + 1.83691406250000000000e+00F, /* 0x3feb2000 */ + 1.84545898437500000000e+00F, /* 0x3fec3800 */ + 1.85400390625000000000e+00F, /* 0x3fed5000 */ + 1.86230468750000000000e+00F, /* 0x3fee6000 */ + 1.87060546875000000000e+00F, /* 0x3fef7000 */ + 1.87915039062500000000e+00F, /* 0x3ff08800 */ + 1.88745117187500000000e+00F, /* 0x3ff19800 */ + 1.89550781250000000000e+00F, /* 0x3ff2a000 */ + 1.90380859375000000000e+00F, /* 0x3ff3b000 */ + 1.91210937500000000000e+00F, /* 0x3ff4c000 */ + 1.92016601562500000000e+00F, /* 0x3ff5c800 */ + 1.92822265625000000000e+00F, /* 0x3ff6d000 */ + 1.93627929687500000000e+00F, /* 0x3ff7d800 */ + 1.94433593750000000000e+00F, /* 0x3ff8e000 */ + 1.95239257812500000000e+00F, /* 0x3ff9e800 */ + 1.96044921875000000000e+00F, /* 0x3ffaf000 */ + 1.96826171875000000000e+00F, /* 0x3ffbf000 */ + 1.97631835937500000000e+00F, /* 0x3ffcf800 */ + 1.98413085937500000000e+00F, /* 0x3ffdf800 */ + 1.99194335937500000000e+00F, /* 0x3ffef800 */ + 2.00000000000000000000e+00F}; /* 0x40000000 */ + +static const float rt_jby32_trail_table_float[97] = { + 0.00000000000000000000e+00F, /* 0x00000000 */ + 1.23941208585165441036e-04F, /* 0x3901f637 */ + 1.46876545841223560274e-05F, /* 0x37766aff */ + 1.70736297150142490864e-04F, /* 0x393307ad */ + 1.13296780909877270460e-04F, /* 0x38ed99bf */ + 9.53458802541717886925e-05F, /* 0x38c7f46e */ + 1.25126505736261606216e-04F, /* 0x39033464 */ + 2.10342666832730174065e-04F, /* 0x395c8f6e */ + 1.14066875539720058441e-04F, /* 0x38ef3730 */ + 8.72047676239162683487e-05F, /* 0x38b6e1b4 */ + 1.36111237225122749805e-04F, /* 0x390eb915 */ + 2.26244374061934649944e-05F, /* 0x37bdc99c */ + 2.40658700931817293167e-04F, /* 0x397c5954 */ + 6.31069415248930454254e-05F, /* 0x38845848 */ + 2.27412077947519719601e-04F, /* 0x396e7577 */ + 5.90185391047270968556e-06F, /* 0x36c6088a */ + 1.35496389702893793583e-04F, /* 0x390e1409 */ + 1.32179571664892137051e-04F, /* 0x390a99af */ + 0.00000000000000000000e+00F, /* 0x00000000 */ + 2.31086043640971183777e-04F, /* 0x39724fb0 */ + 9.66752704698592424393e-05F, /* 0x38cabe24 */ + 8.85332483449019491673e-05F, /* 0x38b9aaed */ + 2.09980673389509320259e-04F, /* 0x395c2e42 */ + 2.20044588786549866199e-04F, /* 0x3966bbc5 */ + 1.21749282698146998882e-04F, /* 0x38ff53a6 */ + 1.62125259521417319775e-04F, /* 0x392a002b */ + 9.97955357888713479042e-05F, /* 0x38d14952 */ + 1.81545779923908412457e-04F, /* 0x393e5d53 */ + 1.65768768056295812130e-04F, /* 0x392dd237 */ + 5.48927710042335093021e-05F, /* 0x38663caa */ + 9.53875860432162880898e-05F, /* 0x38c80ad2 */ + 4.53481625299900770187e-05F, /* 0x383e3438 */ + 1.51062369695864617825e-04F, /* 0x391e667f */ + 1.70453247847035527229e-04F, /* 0x3932bbb2 */ + 1.05505387182347476482e-04F, /* 0x38dd42c6 */ + 2.02269104192964732647e-04F, /* 0x39541833 */ + 2.18442466575652360916e-04F, /* 0x39650db4 */ + 1.55796806211583316326e-04F, /* 0x39235d63 */ + 1.60395247803535312414e-05F, /* 0x37868c9e */ + 4.49578510597348213196e-05F, /* 0x383c9120 */ + 0.00000000000000000000e+00F, /* 0x00000000 */ + 1.26840444863773882389e-04F, /* 0x39050079 */ + 1.82820076588541269302e-04F, /* 0x393fb364 */ + 1.69370483490638434887e-04F, /* 0x3931990b */ + 8.78757418831810355186e-05F, /* 0x38b849ee */ + 1.83815121999941766262e-04F, /* 0x3940be7f */ + 2.14343352126888930798e-04F, /* 0x3960c15b */ + 1.80714370799250900745e-04F, /* 0x393d7e25 */ + 8.41425862745381891727e-05F, /* 0x38b075b5 */ + 1.69945167726837098598e-04F, /* 0x3932334f */ + 1.95121858268976211548e-04F, /* 0x394c99a0 */ + 1.60778334247879683971e-04F, /* 0x3928969b */ + 6.79871009197086095810e-05F, /* 0x388e944c */ + 1.61929419846273958683e-04F, /* 0x3929cb99 */ + 1.99474830878898501396e-04F, /* 0x39512a1e */ + 1.81604162207804620266e-04F, /* 0x393e6cff */ + 1.09270178654696792364e-04F, /* 0x38e527fb */ + 2.27539261686615645885e-04F, /* 0x396e979b */ + 4.90300008095800876617e-05F, /* 0x384da590 */ + 6.28985289949923753738e-05F, /* 0x3883e864 */ + 2.58551553997676819563e-05F, /* 0x37d8e386 */ + 1.82868374395184218884e-04F, /* 0x393fc05b */ + 4.64625991298817098141e-05F, /* 0x3842e0d6 */ + 1.05703387816902250051e-04F, /* 0x38ddad13 */ + 1.17213814519345760345e-04F, /* 0x38f5d0b0 */ + 8.17377731436863541603e-05F, /* 0x38ab6aa2 */ + 0.00000000000000000000e+00F, /* 0x00000000 */ + 1.16847433673683553934e-04F, /* 0x38f50bfd */ + 1.88827965757809579372e-04F, /* 0x3946001f */ + 2.16612941585481166840e-04F, /* 0x39632298 */ + 2.00857131858356297016e-04F, /* 0x39529d2d */ + 1.42199307447299361229e-04F, /* 0x39151b56 */ + 4.12627305195201188326e-05F, /* 0x382d1185 */ + 1.42796401632949709892e-04F, /* 0x3915bb9e */ + 2.03253570361994206905e-04F, /* 0x39552077 */ + 2.23214170546270906925e-04F, /* 0x396a0e99 */ + 2.03244591830298304558e-04F, /* 0x39551e0e */ + 1.43898156238719820976e-04F, /* 0x3916e35e */ + 4.57155256299301981926e-05F, /* 0x383fbeac */ + 1.53365719597786664963e-04F, /* 0x3920d0cc */ + 2.23224633373320102692e-04F, /* 0x396a1168 */ + 1.16566716314991936088e-05F, /* 0x37439106 */ + 7.43694272387074306607e-06F, /* 0x36f98ada */ + 2.11048507480882108212e-04F, /* 0x395d4ce7 */ + 1.34682719362899661064e-04F, /* 0x390d399e */ + 2.29425968427676707506e-05F, /* 0x37c074da */ + 1.20421340398024767637e-04F, /* 0x38fc8ab7 */ + 1.83421318070031702518e-04F, /* 0x394054c9 */ + 2.12376224226318299770e-04F, /* 0x395eb14f */ + 2.07710763788782060146e-04F, /* 0x3959ccef */ + 1.69840845046564936638e-04F, /* 0x3932174e */ + 9.91739216260612010956e-05F, /* 0x38cffb98 */ + 2.40249748458154499531e-04F, /* 0x397beb8d */ + 1.05178231024183332920e-04F, /* 0x38dc9322 */ + 1.82623916771262884140e-04F, /* 0x393f7ebc */ + 2.28821940254420042038e-04F, /* 0x396fefec */ + 0.00000000000000000000e+00F}; /* 0x00000000 */ + + +/* Handle special arguments first */ + + GET_BITS_SP32(x, ux); + ax = ux & (~SIGNBIT_SP32); + + if(ax >= 0x7f800000) + { + /* x is either NaN or infinity */ + if (ux & MANTBITS_SP32) + /* x is NaN */ + return x + x; /* Raise invalid if it is a signalling NaN */ + else if (ux & SIGNBIT_SP32) + return nanf_with_flags(AMD_F_INVALID); + else + /* x is positive infinity */ + return x; + } + else if (ux & SIGNBIT_SP32) + { + /* x is negative. */ + if (x == 0.0F) + /* Handle negative zero first */ + return x; + else + return nanf_with_flags(AMD_F_INVALID); + } + else if (ux <= 0x007fffff) + { + /* x is denormalised or zero */ + if (ux == 0) + /* x is zero */ + return x; + else + { + /* x is denormalised; scale it up */ + /* Normalize x by increasing the exponent by 26 + and subtracting a correction to account for the implicit + bit. This replaces a slow denormalized + multiplication by a fast normal subtraction. */ + static const float corr = 7.888609052210118054e-31F; /* 0x0d800000 */ + denorm = 1; + GET_BITS_SP32(x, ux); + PUT_BITS_SP32(ux | 0x0d800000, x); + x -= corr; + GET_BITS_SP32(x, ux); + } + } + + /* Main algorithm */ + + /* + Find y and e such that x = 2^e * y, where y in [1,4). + This is done using an in-lined variant of splitFloat, + which also ensures that e is even. + */ + y = x; + ux &= EXPBITS_SP32; + ux >>= EXPSHIFTBITS_SP32; + if (ux & 1) + { + GET_BITS_SP32(y, u); + u &= (SIGNBIT_SP32 | MANTBITS_SP32); + u |= ONEEXPBITS_SP32; + PUT_BITS_SP32(u, y); + e = ux - EXPBIAS_SP32; + } + else + { + GET_BITS_SP32(y, u); + u &= (SIGNBIT_SP32 | MANTBITS_SP32); + u |= TWOEXPBITS_SP32; + PUT_BITS_SP32(u, y); + e = ux - EXPBIAS_SP32 - 1; + } + + /* Find the index of the sub-interval of [1,4) in which y lies. */ + + index = (int)(32.0F*y+0.5); + + /* Look up the table values and compute c and r = c/t */ + + rtc_lead = rt_jby32_lead_table_float[index-32]; + rtc_trail = rt_jby32_trail_table_float[index-32]; + c = 0.03125F*index; + r = (y - c)/c; + + /* + Find q = sqrt(1+r) - 1. + From one step of Newton on (q+1)^2 = 1+r + */ + + p = r*0.5F - r*r*(0.1250079870F - r*(0.6250522999e-01F)); + twop = p + p; + q = p - (p*p + (twop - r))/(twop + 2.0); + + /* Reconstruction */ + + rtc = rtc_lead + rtc_trail; + e >>= 1; /* e = e/2 */ + z = rtc_lead + (rtc*q+rtc_trail); + + if (denorm) + { + /* Scale by 2**(e-13) */ + PUT_BITS_SP32(((e - 13) + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, r); + z *= r; + } + else + { + /* Scale by 2**e */ + PUT_BITS_SP32((e + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, r); + z *= r; + } + + return z; + +} +#endif /* SQRTF_AMD_INLINE */ + +#ifdef USE_LOG_KERNEL_AMD +static inline void log_kernel_amd64(double x, unsigned long long ux, int *xexp, double *r1, double *r2) +{ + + int expadjust; + double r, z1, z2, correction, f, f1, f2, q, u, v, poly; + int index; + + /* + Computes natural log(x). Algorithm based on: + Ping-Tak Peter Tang + "Table-driven implementation of the logarithm function in IEEE + floating-point arithmetic" + ACM Transactions on Mathematical Software (TOMS) + Volume 16, Issue 4 (December 1990) + */ + +/* Arrays ln_lead_table and ln_tail_table contain + leading and trailing parts respectively of precomputed + values of natural log(1+i/64), for i = 0, 1, ..., 64. + ln_lead_table contains the first 24 bits of precision, + and ln_tail_table contains a further 53 bits precision. */ + + static const double ln_lead_table[65] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 1.55041813850402832031e-02, /* 0x3f8fc0a800000000 */ + 3.07716131210327148438e-02, /* 0x3f9f829800000000 */ + 4.58095073699951171875e-02, /* 0x3fa7745800000000 */ + 6.06245994567871093750e-02, /* 0x3faf0a3000000000 */ + 7.52233862876892089844e-02, /* 0x3fb341d700000000 */ + 8.96121263504028320312e-02, /* 0x3fb6f0d200000000 */ + 1.03796780109405517578e-01, /* 0x3fba926d00000000 */ + 1.17783010005950927734e-01, /* 0x3fbe270700000000 */ + 1.31576299667358398438e-01, /* 0x3fc0d77e00000000 */ + 1.45181953907012939453e-01, /* 0x3fc2955280000000 */ + 1.58604979515075683594e-01, /* 0x3fc44d2b00000000 */ + 1.71850204467773437500e-01, /* 0x3fc5ff3000000000 */ + 1.84922337532043457031e-01, /* 0x3fc7ab8900000000 */ + 1.97825729846954345703e-01, /* 0x3fc9525a80000000 */ + 2.10564732551574707031e-01, /* 0x3fcaf3c900000000 */ + 2.23143517971038818359e-01, /* 0x3fcc8ff780000000 */ + 2.35566020011901855469e-01, /* 0x3fce270700000000 */ + 2.47836112976074218750e-01, /* 0x3fcfb91800000000 */ + 2.59957492351531982422e-01, /* 0x3fd0a324c0000000 */ + 2.71933674812316894531e-01, /* 0x3fd1675c80000000 */ + 2.83768117427825927734e-01, /* 0x3fd22941c0000000 */ + 2.95464158058166503906e-01, /* 0x3fd2e8e280000000 */ + 3.07025015354156494141e-01, /* 0x3fd3a64c40000000 */ + 3.18453729152679443359e-01, /* 0x3fd4618bc0000000 */ + 3.29753279685974121094e-01, /* 0x3fd51aad80000000 */ + 3.40926527976989746094e-01, /* 0x3fd5d1bd80000000 */ + 3.51976394653320312500e-01, /* 0x3fd686c800000000 */ + 3.62905442714691162109e-01, /* 0x3fd739d7c0000000 */ + 3.73716354370117187500e-01, /* 0x3fd7eaf800000000 */ + 3.84411692619323730469e-01, /* 0x3fd89a3380000000 */ + 3.94993782043457031250e-01, /* 0x3fd9479400000000 */ + 4.05465066432952880859e-01, /* 0x3fd9f323c0000000 */ + 4.15827870368957519531e-01, /* 0x3fda9cec80000000 */ + 4.26084339618682861328e-01, /* 0x3fdb44f740000000 */ + 4.36236739158630371094e-01, /* 0x3fdbeb4d80000000 */ + 4.46287095546722412109e-01, /* 0x3fdc8ff7c0000000 */ + 4.56237375736236572266e-01, /* 0x3fdd32fe40000000 */ + 4.66089725494384765625e-01, /* 0x3fddd46a00000000 */ + 4.75845873355865478516e-01, /* 0x3fde744240000000 */ + 4.85507786273956298828e-01, /* 0x3fdf128f40000000 */ + 4.95077252388000488281e-01, /* 0x3fdfaf5880000000 */ + 5.04556000232696533203e-01, /* 0x3fe02552a0000000 */ + 5.13945698738098144531e-01, /* 0x3fe0723e40000000 */ + 5.23248136043548583984e-01, /* 0x3fe0be72e0000000 */ + 5.32464742660522460938e-01, /* 0x3fe109f380000000 */ + 5.41597247123718261719e-01, /* 0x3fe154c3c0000000 */ + 5.50647079944610595703e-01, /* 0x3fe19ee6a0000000 */ + 5.59615731239318847656e-01, /* 0x3fe1e85f40000000 */ + 5.68504691123962402344e-01, /* 0x3fe23130c0000000 */ + 5.77315330505371093750e-01, /* 0x3fe2795e00000000 */ + 5.86049020290374755859e-01, /* 0x3fe2c0e9e0000000 */ + 5.94707071781158447266e-01, /* 0x3fe307d720000000 */ + 6.03290796279907226562e-01, /* 0x3fe34e2880000000 */ + 6.11801505088806152344e-01, /* 0x3fe393e0c0000000 */ + 6.20240390300750732422e-01, /* 0x3fe3d90260000000 */ + 6.28608644008636474609e-01, /* 0x3fe41d8fe0000000 */ + 6.36907458305358886719e-01, /* 0x3fe4618bc0000000 */ + 6.45137906074523925781e-01, /* 0x3fe4a4f840000000 */ + 6.53301239013671875000e-01, /* 0x3fe4e7d800000000 */ + 6.61398470401763916016e-01, /* 0x3fe52a2d20000000 */ + 6.69430613517761230469e-01, /* 0x3fe56bf9c0000000 */ + 6.77398800849914550781e-01, /* 0x3fe5ad4040000000 */ + 6.85303986072540283203e-01, /* 0x3fe5ee02a0000000 */ + 6.93147122859954833984e-01}; /* 0x3fe62e42e0000000 */ + + static const double ln_tail_table[65] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 5.15092497094772879206e-09, /* 0x3e361f807c79f3db */ + 4.55457209735272790188e-08, /* 0x3e6873c1980267c8 */ + 2.86612990859791781788e-08, /* 0x3e5ec65b9f88c69e */ + 2.23596477332056055352e-08, /* 0x3e58022c54cc2f99 */ + 3.49498983167142274770e-08, /* 0x3e62c37a3a125330 */ + 3.23392843005887000414e-08, /* 0x3e615cad69737c93 */ + 1.35722380472479366661e-08, /* 0x3e4d256ab1b285e9 */ + 2.56504325268044191098e-08, /* 0x3e5b8abcb97a7aa2 */ + 5.81213608741512136843e-08, /* 0x3e6f34239659a5dc */ + 5.59374849578288093334e-08, /* 0x3e6e07fd48d30177 */ + 5.06615629004996189970e-08, /* 0x3e6b32df4799f4f6 */ + 5.24588857848400955725e-08, /* 0x3e6c29e4f4f21cf8 */ + 9.61968535632653505972e-10, /* 0x3e1086c848df1b59 */ + 1.34829655346594463137e-08, /* 0x3e4cf456b4764130 */ + 3.65557749306383026498e-08, /* 0x3e63a02ffcb63398 */ + 3.33431709374069198903e-08, /* 0x3e61e6a6886b0976 */ + 5.13008650536088382197e-08, /* 0x3e6b8abcb97a7aa2 */ + 5.09285070380306053751e-08, /* 0x3e6b578f8aa35552 */ + 3.20853940845502057341e-08, /* 0x3e6139c871afb9fc */ + 4.06713248643004200446e-08, /* 0x3e65d5d30701ce64 */ + 5.57028186706125221168e-08, /* 0x3e6de7bcb2d12142 */ + 5.48356693724804282546e-08, /* 0x3e6d708e984e1664 */ + 1.99407553679345001938e-08, /* 0x3e556945e9c72f36 */ + 1.96585517245087232086e-09, /* 0x3e20e2f613e85bda */ + 6.68649386072067321503e-09, /* 0x3e3cb7e0b42724f6 */ + 5.89936034642113390002e-08, /* 0x3e6fac04e52846c7 */ + 2.85038578721554472484e-08, /* 0x3e5e9b14aec442be */ + 5.09746772910284482606e-08, /* 0x3e6b5de8034e7126 */ + 5.54234668933210171467e-08, /* 0x3e6dc157e1b259d3 */ + 6.29100830926604004874e-09, /* 0x3e3b05096ad69c62 */ + 2.61974119468563937716e-08, /* 0x3e5c2116faba4cdd */ + 4.16752115011186398935e-08, /* 0x3e665fcc25f95b47 */ + 2.47747534460820790327e-08, /* 0x3e5a9a08498d4850 */ + 5.56922172017964209793e-08, /* 0x3e6de647b1465f77 */ + 2.76162876992552906035e-08, /* 0x3e5da71b7bf7861d */ + 7.08169709942321478061e-09, /* 0x3e3e6a6886b09760 */ + 5.77453510221151779025e-08, /* 0x3e6f0075eab0ef64 */ + 4.43021445893361960146e-09, /* 0x3e33071282fb989b */ + 3.15140984357495864573e-08, /* 0x3e60eb43c3f1bed2 */ + 2.95077445089736670973e-08, /* 0x3e5faf06ecb35c84 */ + 1.44098510263167149349e-08, /* 0x3e4ef1e63db35f68 */ + 1.05196987538551827693e-08, /* 0x3e469743fb1a71a5 */ + 5.23641361722697546261e-08, /* 0x3e6c1cdf404e5796 */ + 7.72099925253243069458e-09, /* 0x3e4094aa0ada625e */ + 5.62089493829364197156e-08, /* 0x3e6e2d4c96fde3ec */ + 3.53090261098577946927e-08, /* 0x3e62f4d5e9a98f34 */ + 3.80080516835568242269e-08, /* 0x3e6467c96ecc5cbe */ + 5.66961038386146408282e-08, /* 0x3e6e7040d03dec5a */ + 4.42287063097349852717e-08, /* 0x3e67bebf4282de36 */ + 3.45294525105681104660e-08, /* 0x3e6289b11aeb783f */ + 2.47132034530447431509e-08, /* 0x3e5a891d1772f538 */ + 3.59655343422487209774e-08, /* 0x3e634f10be1fb591 */ + 5.51581770357780862071e-08, /* 0x3e6d9ce1d316eb93 */ + 3.60171867511861372793e-08, /* 0x3e63562a19a9c442 */ + 1.94511067964296180547e-08, /* 0x3e54e2adf548084c */ + 1.54137376631349347838e-08, /* 0x3e508ce55cc8c97a */ + 3.93171034490174464173e-09, /* 0x3e30e2f613e85bda */ + 5.52990607758839766440e-08, /* 0x3e6db03ebb0227bf */ + 3.29990737637586136511e-08, /* 0x3e61b75bb09cb098 */ + 1.18436010922446096216e-08, /* 0x3e496f16abb9df22 */ + 4.04248680368301346709e-08, /* 0x3e65b3f399411c62 */ + 2.27418915900284316293e-08, /* 0x3e586b3e59f65355 */ + 1.70263791333409206020e-08, /* 0x3e52482ceae1ac12 */ + 5.76999904754328540596e-08}; /* 0x3e6efa39ef35793c */ + + /* Approximating polynomial coefficients for x near 1.0 */ + static const double + ca_1 = 8.33333333333317923934e-02, /* 0x3fb55555555554e6 */ + ca_2 = 1.25000000037717509602e-02, /* 0x3f89999999bac6d4 */ + ca_3 = 2.23213998791944806202e-03, /* 0x3f62492307f1519f */ + ca_4 = 4.34887777707614552256e-04; /* 0x3f3c8034c85dfff0 */ + + /* Approximating polynomial coefficients for other x */ + static const double + cb_1 = 8.33333333333333593622e-02, /* 0x3fb5555555555557 */ + cb_2 = 1.24999999978138668903e-02, /* 0x3f89999999865ede */ + cb_3 = 2.23219810758559851206e-03; /* 0x3f6249423bd94741 */ + + static const unsigned long long + log_thresh1 = 0x3fee0faa00000000, + log_thresh2 = 0x3ff1082c00000000; + + /* log_thresh1 = 9.39412117004394531250e-1 = 0x3fee0faa00000000 + log_thresh2 = 1.06449508666992187500 = 0x3ff1082c00000000 */ + if (ux >= log_thresh1 && ux <= log_thresh2) + { + /* Arguments close to 1.0 are handled separately to maintain + accuracy. + + The approximation in this region exploits the identity + log( 1 + r ) = log( 1 + u/2 ) / log( 1 - u/2 ), where + u = 2r / (2+r). + Note that the right hand side has an odd Taylor series expansion + which converges much faster than the Taylor series expansion of + log( 1 + r ) in r. Thus, we approximate log( 1 + r ) by + u + A1 * u^3 + A2 * u^5 + ... + An * u^(2n+1). + + One subtlety is that since u cannot be calculated from + r exactly, the rounding error in the first u should be + avoided if possible. To accomplish this, we observe that + u = r - r*r/(2+r). + Since x (=1+r) is the input argument, and thus presumed exact, + the formula above approximates u accurately because + u = r - correction, + and the magnitude of "correction" (of the order of r*r) + is small. + With these observations, we will approximate log( 1 + r ) by + r + ( (A1*u^3 + ... + An*u^(2n+1)) - correction ). + + We approximate log(1+r) by an odd polynomial in u, where + u = 2r/(2+r) = r - r*r/(2+r). + */ + r = x - 1.0; + u = r / (2.0 + r); + correction = r * u; + u = u + u; + v = u * u; + z1 = r; + z2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + *r1 = z1; + *r2 = z2; + *xexp = 0; + } + else + { + /* + First, we decompose the argument x to the form + x = 2**M * (F1 + F2), + where 1 <= F1+F2 < 2, M has the value of an integer, + F1 = 1 + j/64, j ranges from 0 to 64, and |F2| <= 1/128. + + Second, we approximate log( 1 + F2/F1 ) by an odd polynomial + in U, where U = 2 F2 / (2 F2 + F1). + Note that log( 1 + F2/F1 ) = log( 1 + U/2 ) - log( 1 - U/2 ). + The core approximation calculates + Poly = [log( 1 + U/2 ) - log( 1 - U/2 )]/U - 1. + Note that log(1 + U/2) - log(1 - U/2) = 2 arctanh ( U/2 ), + thus, Poly = 2 arctanh( U/2 ) / U - 1. + + It is not hard to see that + log(x) = M*log(2) + log(F1) + log( 1 + F2/F1 ). + Hence, we return Z1 = log(F1), and Z2 = log( 1 + F2/F1). + The values of log(F1) are calculated beforehand and stored + in the program. + */ + + f = x; + if (ux < IMPBIT_DP64) + { + /* The input argument x is denormalized */ + /* Normalize f by increasing the exponent by 60 + and subtracting a correction to account for the implicit + bit. This replaces a slow denormalized + multiplication by a fast normal subtraction. */ + static const double corr = 2.5653355008114851558350183e-290; /* 0x03d0000000000000 */ + GET_BITS_DP64(f, ux); + ux |= 0x03d0000000000000; + PUT_BITS_DP64(ux, f); + f -= corr; + GET_BITS_DP64(f, ux); + expadjust = 60; + } + else + expadjust = 0; + + /* Store the exponent of x in xexp and put + f into the range [0.5,1) */ + *xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64 - expadjust; + PUT_BITS_DP64((ux & MANTBITS_DP64) | HALFEXPBITS_DP64, f); + + /* Now x = 2**xexp * f, 1/2 <= f < 1. */ + + /* Set index to be the nearest integer to 128*f */ + r = 128.0 * f; + index = (int)(r + 0.5); + + z1 = ln_lead_table[index-64]; + q = ln_tail_table[index-64]; + f1 = index * 0.0078125; /* 0.0078125 = 1/128 */ + f2 = f - f1; + /* At this point, x = 2**xexp * ( f1 + f2 ) where + f1 = j/128, j = 64, 65, ..., 128 and |f2| <= 1/256. */ + + /* Calculate u = 2 f2 / ( 2 f1 + f2 ) = f2 / ( f1 + 0.5*f2 ) */ + /* u = f2 / (f1 + 0.5 * f2); */ + u = f2 / (f1 + 0.5 * f2); + + /* Here, |u| <= 2(exp(1/16)-1) / (exp(1/16)+1). + The core approximation calculates + poly = [log(1 + u/2) - log(1 - u/2)]/u - 1 */ + v = u * u; + poly = (v * (cb_1 + v * (cb_2 + v * cb_3))); + z2 = q + (u + u * poly); + *r1 = z1; + *r2 = z2; + } + return; +} +#endif /* USE_LOG_KERNEL_AMD */ + +#if defined(USE_REMAINDER_PIBY2F_INLINE) +/* Define this to get debugging print statements activated */ +#define DEBUGGING_PRINT +#undef DEBUGGING_PRINT + + +#ifdef DEBUGGING_PRINT +#include <stdio.h> +char *d2b(long long d, int bitsper, int point) +{ + static char buff[200]; + int i, j; + j = bitsper; + if (point >= 0 && point <= bitsper) + j++; + buff[j] = '\0'; + for (i = bitsper - 1; i >= 0; i--) + { + j--; + if (d % 2 == 1) + buff[j] = '1'; + else + buff[j] = '0'; + if (i == point) + { + j--; + buff[j] = '.'; + } + d /= 2; + } + return buff; +} +#endif + +/* Given positive argument x, reduce it to the range [-pi/4,pi/4] using + extra precision, and return the result in r. + Return value "region" tells how many lots of pi/2 were subtracted + from x to put it in the range [-pi/4,pi/4], mod 4. */ +static inline void __remainder_piby2f_inline(unsigned long long ux, double *r, int *region) +{ + + /* This method simulates multi-precision floating-point + arithmetic and is accurate for all 1 <= x < infinity */ +#define bitsper 36 + unsigned long long res[10]; + unsigned long long u, carry, mask, mant, nextbits; + int first, last, i, rexp, xexp, resexp, ltb, determ, bc; + double dx; + static const double + piby2 = 1.57079632679489655800e+00; /* 0x3ff921fb54442d18 */ +#ifdef WINDOWS + static unsigned long long pibits[] = + { + 0LL, + 5215LL, 13000023176LL, 11362338026LL, 67174558139LL, + 34819822259LL, 10612056195LL, 67816420731LL, 57840157550LL, + 19558516809LL, 50025467026LL, 25186875954LL, 18152700886LL + }; +#else + static unsigned long long pibits[] = + { + 0L, + 5215L, 13000023176L, 11362338026L, 67174558139L, + 34819822259L, 10612056195L, 67816420731L, 57840157550L, + 19558516809L, 50025467026L, 25186875954L, 18152700886L + }; +#endif + +#ifdef DEBUGGING_PRINT + printf("On entry, x = %25.20e = %s\n", x, double2hex(&x)); +#endif + + xexp = (int)(((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64); + ux = ((ux & MANTBITS_DP64) | IMPBIT_DP64) >> 29; + +#ifdef DEBUGGING_PRINT + printf("ux = %s\n", d2b(ux, 64, -1)); +#endif + + /* Now ux is the mantissa bit pattern of x as a long integer */ + mask = 1; + mask = (mask << bitsper) - 1; + + /* Set first and last to the positions of the first + and last chunks of 2/pi that we need */ + first = xexp / bitsper; + resexp = xexp - first * bitsper; + /* 120 is the theoretical maximum number of bits (actually + 115 for IEEE single precision) that we need to extract + from the middle of 2/pi to compute the reduced argument + accurately enough for our purposes */ + last = first + 120 / bitsper; + +#ifdef DEBUGGING_PRINT + printf("first = %d, last = %d\n", first, last); +#endif + + /* Do a long multiplication of the bits of 2/pi by the + integer mantissa */ + /* Unroll the loop. This is only correct because we know + that bitsper is fixed as 36. */ + res[4] = 0; + u = pibits[last] * ux; + res[3] = u & mask; + carry = u >> bitsper; + u = pibits[last - 1] * ux + carry; + res[2] = u & mask; + carry = u >> bitsper; + u = pibits[last - 2] * ux + carry; + res[1] = u & mask; + carry = u >> bitsper; + u = pibits[first] * ux + carry; + res[0] = u & mask; + +#ifdef DEBUGGING_PRINT + printf("resexp = %d\n", resexp); + printf("Significant part of x * 2/pi with binary" + " point in correct place:\n"); + for (i = 0; i <= last - first; i++) + { + if (i > 0 && i % 5 == 0) + printf("\n "); + if (i == 1) + printf("%s ", d2b(res[i], bitsper, resexp)); + else + printf("%s ", d2b(res[i], bitsper, -1)); + } + printf("\n"); +#endif + + /* Reconstruct the result */ + ltb = (int)((((res[0] << bitsper) | res[1]) + >> (bitsper - 1 - resexp)) & 7); + + /* determ says whether the fractional part is >= 0.5 */ + determ = ltb & 1; + +#ifdef DEBUGGING_PRINT + printf("ltb = %d (last two bits before binary point" + " and first bit after)\n", ltb); + printf("determ = %d (1 means need to negate because the fractional\n" + " part of x * 2/pi is greater than 0.5)\n", determ); +#endif + + i = 1; + if (determ) + { + /* The mantissa is >= 0.5. We want to subtract it + from 1.0 by negating all the bits */ + *region = ((ltb >> 1) + 1) & 3; + mant = 1; + mant = ~(res[1]) & ((mant << (bitsper - resexp)) - 1); + while (mant < 0x0000000000010000) + { + i++; + mant = (mant << bitsper) | (~(res[i]) & mask); + } + nextbits = (~(res[i+1]) & mask); + } + else + { + *region = (ltb >> 1); + mant = 1; + mant = res[1] & ((mant << (bitsper - resexp)) - 1); + while (mant < 0x0000000000010000) + { + i++; + mant = (mant << bitsper) | res[i]; + } + nextbits = res[i+1]; + } + +#ifdef DEBUGGING_PRINT + printf("First bits of mant = %s\n", d2b(mant, bitsper, -1)); +#endif + + /* Normalize the mantissa. The shift value 6 here, determined by + trial and error, seems to give optimal speed. */ + bc = 0; + while (mant < 0x0000400000000000LL) + { + bc += 6; + mant <<= 6; + } + while (mant < 0x0010000000000000LL) + { + bc++; + mant <<= 1; + } + mant |= nextbits >> (bitsper - bc); + + rexp = 52 + resexp - bc - i * bitsper; + +#ifdef DEBUGGING_PRINT + printf("Normalised mantissa = 0x%016lx\n", mant); + printf("Exponent to be inserted on mantissa = rexp = %d\n", rexp); +#endif + + /* Put the result exponent rexp onto the mantissa pattern */ + u = ((unsigned long long)rexp + EXPBIAS_DP64) << EXPSHIFTBITS_DP64; + ux = (mant & MANTBITS_DP64) | u; + if (determ) + /* If we negated the mantissa we negate x too */ + ux |= SIGNBIT_DP64; + PUT_BITS_DP64(ux, dx); + +#ifdef DEBUGGING_PRINT + printf("(x*2/pi) = %25.20e = %s\n", dx, double2hex(&dx)); +#endif + + /* x is a double precision version of the fractional part of + x * 2 / pi. Multiply x by pi/2 in double precision + to get the reduced argument r. */ + *r = dx * piby2; + +#ifdef DEBUGGING_PRINT + printf(" r = frac(x*2/pi) * pi/2:\n"); + printf(" r = %25.20e = %s\n", *r, double2hex(r)); + printf("region = (number of pi/2 subtracted from x) mod 4 = %d\n", + *region); +#endif +} +#endif /* USE_REMAINDER_PIBY2F_INLINE */ + +#if defined(WINDOWS) +#if defined(USE_HANDLE_ERROR) || defined(USE_HANDLE_ERRORF) +#include <errno.h> +#endif + +#if defined(USE_HANDLE_ERROR) +/* Define the Microsoft specific error handling routines */ +static __declspec(noinline) double handle_error(const char *name, + unsigned long long value, + int type, int flags, int error, + double arg1, double arg2) +{ + double z; + struct _exception exception_data; + exception_data.type = type; + exception_data.name = (char*)name; + exception_data.arg1 = arg1; + exception_data.arg2 = arg2; + PUT_BITS_DP64(value, z); + exception_data.retval = z; + raise_fpsw_flags(flags); + if (!_matherr(&exception_data)) + { + errno = error; + } + return exception_data.retval; +} +#endif /* USE_HANDLE_ERROR */ + +#if defined(USE_HANDLE_ERRORF) +static __declspec(noinline) float handle_errorf(const char *name, + unsigned int value, + int type, int flags, int error, + float arg1, float arg2) +{ + float z; + struct _exception exception_data; + exception_data.type = type; + exception_data.name = (char*)name; + exception_data.arg1 = (double)arg1; + exception_data.arg2 = (double)arg2; + PUT_BITS_SP32(value, z); + exception_data.retval = z; + raise_fpsw_flags(flags); + if (!_matherr(&exception_data)) + { + errno = error; + } + return (float)exception_data.retval; +} +#endif /* USE_HANDLE_ERRORF */ +#endif /* WINDOWS */ + +#endif /* LIBM_INLINES_AMD_H_INCLUDED */
diff --git a/inc/libm_special.h b/inc/libm_special.h new file mode 100644 index 0000000..0833b7b --- /dev/null +++ b/inc/libm_special.h
@@ -0,0 +1,84 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifndef __LIBM_SPECIAL_H__ +#define __LIBM_SPECIAL_H__ + +// exception status set +#define MXCSR_ES_INEXACT 0x00000020 +#define MXCSR_ES_UNDERFLOW 0x00000010 +#define MXCSR_ES_OVERFLOW 0x00000008 +#define MXCSR_ES_DIVBYZERO 0x00000004 +#define MXCSR_ES_INVALID 0x00000001 + +void __amd_handle_errorf(int type, int error, const char *name, + float arg1, unsigned int arg1_is_snan, + float arg2, unsigned int arg2_is_snan, + float retval, unsigned int retval_is_snan); + +void __amd_handle_error(int type, int error, const char *name, + double arg1, + double arg2, + double retval); + +/* Code from GRTE/v4 math.h */ +/* Types of exceptions in the `type' field. */ +#ifndef DOMAIN +struct exception + { + int type; + char *name; + double arg1; + double arg2; + double retval; + }; + +extern int matherr (struct exception *__exc); + +# define X_TLOSS 1.41484755040568800000e+16 + +/* Types of exceptions in the `type' field. */ +# define DOMAIN 1 +# define SING 2 +# define OVERFLOW 3 +# define UNDERFLOW 4 +# define TLOSS 5 +# define PLOSS 6 + +/* SVID mode specifies returning this large value instead of infinity. */ +# define HUGE 3.40282347e+38F + +/* Use this define to enable a (dummy) definition of matherr(). */ +#define NEED_FAKE_MATHERR + +#else /* !SVID */ + +# ifdef __USE_XOPEN +/* X/Open wants another strange constant. */ +# define MAXFLOAT 3.40282347e+38F +# endif + +#endif /* DOMAIN */ +/* Code from GRTE/v4 math.h */ + +#endif // __LIBM_SPECIAL_H__
diff --git a/inc/libm_util_amd.h b/inc/libm_util_amd.h new file mode 100644 index 0000000..f7347d0 --- /dev/null +++ b/inc/libm_util_amd.h
@@ -0,0 +1,195 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifndef LIBM_UTIL_AMD_H_INCLUDED +#define LIBM_UTIL_AMD_H_INCLUDED 1 + + + + + + +typedef float F32; +typedef unsigned int U32; +typedef int S32; + +typedef double F64; +typedef unsigned long long U64; +typedef long long S64; + +union UT32_ +{ + F32 f32; + U32 u32; +}; + +union UT64_ +{ + F64 f64; + U64 u64; + + F32 f32[2]; + U32 u32[2]; +}; + +typedef union UT32_ UT32; +typedef union UT64_ UT64; + + + + +#define QNAN_MASK_32 0x00400000 +#define QNAN_MASK_64 0x0008000000000000 + + +#define MULTIPLIER_SP 24 +#define MULTIPLIER_DP 53 + +static const double VAL_2PMULTIPLIER_DP = 9007199254740992.0; +static const double VAL_2PMMULTIPLIER_DP = 1.1102230246251565404236316680908e-16; +static const float VAL_2PMULTIPLIER_SP = 16777216.0F; +static const float VAL_2PMMULTIPLIER_SP = 5.9604645e-8F; + + + + + +/* Definitions for double functions on 64 bit machines */ +#define SIGNBIT_DP64 0x8000000000000000 +#define EXPBITS_DP64 0x7ff0000000000000 +#define MANTBITS_DP64 0x000fffffffffffff +#define ONEEXPBITS_DP64 0x3ff0000000000000 +#define TWOEXPBITS_DP64 0x4000000000000000 +#define HALFEXPBITS_DP64 0x3fe0000000000000 +#define IMPBIT_DP64 0x0010000000000000 +#define QNANBITPATT_DP64 0x7ff8000000000000 +#define INDEFBITPATT_DP64 0xfff8000000000000 +#define PINFBITPATT_DP64 0x7ff0000000000000 +#define NINFBITPATT_DP64 0xfff0000000000000 +#define EXPBIAS_DP64 1023 +#define EXPSHIFTBITS_DP64 52 +#define BIASEDEMIN_DP64 1 +#define EMIN_DP64 -1022 +#define BIASEDEMAX_DP64 2046 +#define EMAX_DP64 1023 +#define LAMBDA_DP64 1.0e300 +#define MANTLENGTH_DP64 53 +#define BASEDIGITS_DP64 15 + + +/* These definitions, used by float functions, + are for both 32 and 64 bit machines */ +#define SIGNBIT_SP32 0x80000000 +#define EXPBITS_SP32 0x7f800000 +#define MANTBITS_SP32 0x007fffff +#define ONEEXPBITS_SP32 0x3f800000 +#define TWOEXPBITS_SP32 0x40000000 +#define HALFEXPBITS_SP32 0x3f000000 +#define IMPBIT_SP32 0x00800000 +#define QNANBITPATT_SP32 0x7fc00000 +#define INDEFBITPATT_SP32 0xffc00000 +#define PINFBITPATT_SP32 0x7f800000 +#define NINFBITPATT_SP32 0xff800000 +#define EXPBIAS_SP32 127 +#define EXPSHIFTBITS_SP32 23 +#define BIASEDEMIN_SP32 1 +#define EMIN_SP32 -126 +#define BIASEDEMAX_SP32 254 +#define EMAX_SP32 127 +#define LAMBDA_SP32 1.0e30 +#define MANTLENGTH_SP32 24 +#define BASEDIGITS_SP32 7 + +#define CLASS_SIGNALLING_NAN 1 +#define CLASS_QUIET_NAN 2 +#define CLASS_NEGATIVE_INFINITY 3 +#define CLASS_NEGATIVE_NORMAL_NONZERO 4 +#define CLASS_NEGATIVE_DENORMAL 5 +#define CLASS_NEGATIVE_ZERO 6 +#define CLASS_POSITIVE_ZERO 7 +#define CLASS_POSITIVE_DENORMAL 8 +#define CLASS_POSITIVE_NORMAL_NONZERO 9 +#define CLASS_POSITIVE_INFINITY 10 + +#define OLD_BITS_SP32(x) (*((unsigned int *)&x)) +#define OLD_BITS_DP64(x) (*((unsigned long long *)&x)) + +/* Alternatives to the above functions which don't have + problems when using high optimization levels on gcc */ +#define GET_BITS_SP32(x, ux) \ + { \ + volatile union {float f; unsigned int i;} _bitsy; \ + _bitsy.f = (x); \ + ux = _bitsy.i; \ + } +#define PUT_BITS_SP32(ux, x) \ + { \ + volatile union {float f; unsigned int i;} _bitsy; \ + _bitsy.i = (ux); \ + x = _bitsy.f; \ + } + +#define GET_BITS_DP64(x, ux) \ + { \ + volatile union {double d; unsigned long long i;} _bitsy; \ + _bitsy.d = (x); \ + ux = _bitsy.i; \ + } +#define PUT_BITS_DP64(ux, x) \ + { \ + volatile union {double d; unsigned long long i;} _bitsy; \ + _bitsy.i = (ux); \ + x = _bitsy.d; \ + } + + +/* Processor-dependent floating-point status flags */ +#define AMD_F_INEXACT 0x00000020 +#define AMD_F_UNDERFLOW 0x00000010 +#define AMD_F_OVERFLOW 0x00000008 +#define AMD_F_DIVBYZERO 0x00000004 +#define AMD_F_INVALID 0x00000001 + +/* Processor-dependent floating-point precision-control flags */ +#define AMD_F_EXTENDED 0x00000300 +#define AMD_F_DOUBLE 0x00000200 +#define AMD_F_SINGLE 0x00000000 + +/* Processor-dependent floating-point rounding-control flags */ +#define AMD_F_RC_NEAREST 0x00000000 +#define AMD_F_RC_DOWN 0x00002000 +#define AMD_F_RC_UP 0x00004000 +#define AMD_F_RC_ZERO 0x00006000 + +/* How to get hold of an assembly square root instruction: + * ASMQRT(x,y) computes y = sqrt(x). + */ +#ifdef WINDOWS +/* VC++ intrinsic call */ +#define ASMSQRT(x,y) _mm_store_sd(&y, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&x))); +#else +/* Hammer sqrt instruction */ +#define ASMSQRT(x,y) asm volatile ("sqrtsd %1, %0" : "=x" (y) : "x" (x)); +#endif + +#endif /* LIBM_UTIL_AMD_H_INCLUDED */
diff --git a/libacml.h b/libacml.h new file mode 100644 index 0000000..92c2ccb --- /dev/null +++ b/libacml.h
@@ -0,0 +1,76 @@ +// Copyright 2010 and onwards Google Inc. +// Author: Martin Thuresson +// +// Expose fast k8 implementation of math functions with the prefix +// "acml_". Currently acml_log(), acml_exp(), and acmp_pow() have +// shown to have significantly better performance over glibc libm +// and atleast as good precision. +// https://wiki.corp.google.com/twiki/bin/view/Main/CompilerMathOptimization +// +// When build with --cpu=piii, acml_* will call the pure libm functions, +// avoiding the need to special case the calls. +// +// TODO(martint): Update glibc to match the libacml performance. + +#ifndef THIRD_PARTY__OPEN64_LIBACML_MV__LIBACML_H_ +#define THIRD_PARTY__OPEN64_LIBACML_MV__LIBACML_H_ + +#ifndef USE_LIBACML_IMPLEMENTATION +#define USE_LIBACML_IMPLEMENTATION defined(__x86_64__) +#endif + +#if USE_LIBACML_IMPLEMENTATION +#include "third_party/open64_libacml_mv/inc/fn_macros.h" +#else +#include <math.h> +#endif + +extern "C" { + +#if USE_LIBACML_IMPLEMENTATION +// The k8 implementation of the math functions. +#define acml_exp_k8 FN_PROTOTYPE(exp) +#define acml_expf_k8 FN_PROTOTYPE(expf) +#define acml_log_k8 FN_PROTOTYPE(log) +#define acml_pow_k8 FN_PROTOTYPE(pow) +double acml_exp_k8(double x); +float acml_expf_k8(float x); +double acml_log_k8(double x); +double acml_pow_k8(double x, double y); +#endif + +static inline double acml_exp(double x) { +#if USE_LIBACML_IMPLEMENTATION + return acml_exp_k8(x); +#else + return exp(x); +#endif +} + +static inline float acml_expf(float x) { +#if USE_LIBACML_IMPLEMENTATION + return acml_expf_k8(x); +#else + return expf(x); +#endif +} + +static inline double acml_log(double x) { +#if USE_LIBACML_IMPLEMENTATION + return acml_log_k8(x); +#else + return log(x); +#endif +} + +static inline double acml_pow(double x, double y) { +#if USE_LIBACML_IMPLEMENTATION + return acml_pow_k8(x, y); +#else + return pow(x, y); +#endif +} + +} + +#endif // THIRD_PARTY__OPEN64_LIBACML_MV__LIBACML_H_
diff --git a/libacml_portability_test.cc b/libacml_portability_test.cc new file mode 100644 index 0000000..1f62d1a --- /dev/null +++ b/libacml_portability_test.cc
@@ -0,0 +1,16 @@ +#include "testing/base/public/gmock.h" +#include "testing/base/public/gunit.h" +#include "third_party/open64_libacml_mv/libacml.h" + +namespace { + +using ::testing::Eq; + +TEST(LibacmlPortabilityTest, Trivial) { + EXPECT_THAT(acml_exp(0), Eq(1)); + EXPECT_THAT(acml_expf(0), Eq(1)); + EXPECT_THAT(acml_pow(2, 2), Eq(4)); + EXPECT_THAT(acml_log(1), Eq(0)); +} + +} // namespace
diff --git a/src/acos.c b/src/acos.c new file mode 100644 index 0000000..26bac6c --- /dev/null +++ b/src/acos.c
@@ -0,0 +1,183 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_VAL_WITH_FLAGS +#define USE_NAN_WITH_FLAGS +#define USE_HANDLE_ERROR +#include "../inc/libm_inlines_amd.h" +#undef USE_NAN_WITH_FLAGS +#undef USE_VAL_WITH_FLAGS +#undef USE_HANDLE_ERROR + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range argument */ +static inline double retval_errno_edom(double x) +{ + struct exception exc; + exc.arg1 = x; + exc.arg2 = x; + exc.name = (char *)"acos"; + exc.type = DOMAIN; + if (_LIB_VERSION == _SVID_) + exc.retval = HUGE; + else + exc.retval = nan_with_flags(AMD_F_INVALID); + if (_LIB_VERSION == _POSIX_) + __set_errno(EDOM); + else if (!matherr(&exc)) + { + if(_LIB_VERSION == _SVID_) + (void)fputs("acos: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#ifdef WINDOWS +#pragma function(acos) +#endif + +double FN_PROTOTYPE(acos)(double x) +{ + /* Computes arccos(x). + The argument is first reduced by noting that arccos(x) + is invalid for abs(x) > 1. For denormal and small + arguments arccos(x) = pi/2 to machine accuracy. + Remaining argument ranges are handled as follows. + For abs(x) <= 0.5 use + arccos(x) = pi/2 - arcsin(x) + = pi/2 - (x + x^3*R(x^2)) + where R(x^2) is a rational minimax approximation to + (arcsin(x) - x)/x^3. + For abs(x) > 0.5 exploit the identity: + arccos(x) = pi - 2*arcsin(sqrt(1-x)/2) + together with the above rational approximation, and + reconstruct the terms carefully. + */ + + /* Some constants and split constants. */ + + static const double + pi = 3.1415926535897933e+00, /* 0x400921fb54442d18 */ + piby2 = 1.5707963267948965580e+00, /* 0x3ff921fb54442d18 */ + piby2_head = 1.5707963267948965580e+00, /* 0x3ff921fb54442d18 */ + piby2_tail = 6.12323399573676603587e-17; /* 0x3c91a62633145c07 */ + + double u, y, s=0.0, r; + int xexp, xnan, transform=0; + + unsigned long long ux, aux, xneg; + GET_BITS_DP64(x, ux); + aux = ux & ~SIGNBIT_DP64; + xneg = (ux & SIGNBIT_DP64); + xnan = (aux > PINFBITPATT_DP64); + xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64; + + /* Special cases */ + + if (xnan) + { +#ifdef WINDOWS + return handle_error("acos", ux|0x0008000000000000, _DOMAIN, + 0, EDOM, x, 0.0); +#else + return x + x; /* With invalid if it's a signalling NaN */ +#endif + } + else if (xexp < -56) + { /* y small enough that arccos(x) = pi/2 */ + return val_with_flags(piby2, AMD_F_INEXACT); + } + else if (xexp >= 0) + { /* abs(x) >= 1.0 */ + if (x == 1.0) + return 0.0; + else if (x == -1.0) + return val_with_flags(pi, AMD_F_INEXACT); + else +#ifdef WINDOWS + return handle_error("acos", INDEFBITPATT_DP64, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return retval_errno_edom(x); +#endif + } + + if (xneg) y = -x; + else y = x; + + transform = (xexp >= -1); /* abs(x) >= 0.5 */ + + if (transform) + { /* Transform y into the range [0,0.5) */ + r = 0.5*(1.0 - y); +#ifdef WINDOWS + /* VC++ intrinsic call */ + _mm_store_sd(&s, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&r))); +#else + /* Hammer sqrt instruction */ + asm volatile ("sqrtsd %1, %0" : "=x" (s) : "x" (r)); +#endif + y = s; + } + else + r = y*y; + + /* Use a rational approximation for [0.0, 0.5] */ + + u = r*(0.227485835556935010735943483075 + + (-0.445017216867635649900123110649 + + (0.275558175256937652532686256258 + + (-0.0549989809235685841612020091328 + + (0.00109242697235074662306043804220 + + 0.0000482901920344786991880522822991*r)*r)*r)*r)*r)/ + (1.36491501334161032038194214209 + + (-3.28431505720958658909889444194 + + (2.76568859157270989520376345954 + + (-0.943639137032492685763471240072 + + 0.105869422087204370341222318533*r)*r)*r)*r); + + if (transform) + { /* Reconstruct acos carefully in transformed region */ + if (xneg) return pi - 2.0*(s+(y*u - piby2_tail)); + else + { + double c, s1; + unsigned long long us; + GET_BITS_DP64(s, us); + PUT_BITS_DP64(0xffffffff00000000 & us, s1); + c = (r-s1*s1)/(s+s1); + return 2.0*s1 + (2.0*c+2.0*y*u); + } + } + else + return piby2_head - (x - (piby2_tail - x*u)); +} + +weak_alias (__acos, acos)
diff --git a/src/acosf.c b/src/acosf.c new file mode 100644 index 0000000..4464661 --- /dev/null +++ b/src/acosf.c
@@ -0,0 +1,181 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_VALF_WITH_FLAGS +#define USE_NANF_WITH_FLAGS +#define USE_HANDLE_ERRORF +#include "../inc/libm_inlines_amd.h" +#undef USE_NANF_WITH_FLAGS +#undef USE_VALF_WITH_FLAGS +#undef USE_HANDLE_ERRORF + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range argument */ +static inline float retval_errno_edom(float x) +{ + struct exception exc; + exc.arg1 = (double)x; + exc.arg2 = (double)x; + exc.name = (char *)"acosf"; + exc.type = DOMAIN; + if (_LIB_VERSION == _SVID_) + exc.retval = HUGE; + else + exc.retval = nanf_with_flags(AMD_F_INVALID); + if (_LIB_VERSION == _POSIX_) + __set_errno(EDOM); + else if (!matherr(&exc)) + { + if(_LIB_VERSION == _SVID_) + (void)fputs("acosf: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#ifdef WINDOWS +#pragma function(acosf) +#endif + +float FN_PROTOTYPE(acosf)(float x) +{ + /* Computes arccos(x). + The argument is first reduced by noting that arccos(x) + is invalid for abs(x) > 1. For denormal and small + arguments arccos(x) = pi/2 to machine accuracy. + Remaining argument ranges are handled as follows. + For abs(x) <= 0.5 use + arccos(x) = pi/2 - arcsin(x) + = pi/2 - (x + x^3*R(x^2)) + where R(x^2) is a rational minimax approximation to + (arcsin(x) - x)/x^3. + For abs(x) > 0.5 exploit the identity: + arccos(x) = pi - 2*arcsin(sqrt(1-x)/2) + together with the above rational approximation, and + reconstruct the terms carefully. + */ + + /* Some constants and split constants. */ + + static const float + piby2 = 1.5707963705e+00F; /* 0x3fc90fdb */ + static const double + pi = 3.1415926535897933e+00, /* 0x400921fb54442d18 */ + piby2_head = 1.5707963267948965580e+00, /* 0x3ff921fb54442d18 */ + piby2_tail = 6.12323399573676603587e-17; /* 0x3c91a62633145c07 */ + + float u, y, s = 0.0F, r; + int xexp, xnan, transform = 0; + + unsigned int ux, aux, xneg; + + GET_BITS_SP32(x, ux); + aux = ux & ~SIGNBIT_SP32; + xneg = (ux & SIGNBIT_SP32); + xnan = (aux > PINFBITPATT_SP32); + xexp = (int)((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32; + + /* Special cases */ + + if (xnan) + { +#ifdef WINDOWS + return handle_errorf("acosf", ux|0x00400000, _DOMAIN, 0, + EDOM, x, 0.0F); +#else + return x + x; /* With invalid if it's a signalling NaN */ +#endif + } + else if (xexp < -26) + /* y small enough that arccos(x) = pi/2 */ + return valf_with_flags(piby2, AMD_F_INEXACT); + else if (xexp >= 0) + { /* abs(x) >= 1.0 */ + if (x == 1.0F) + return 0.0F; + else if (x == -1.0F) + return valf_with_flags((float)pi, AMD_F_INEXACT); + else +#ifdef WINDOWS + return handle_errorf("acosf", INDEFBITPATT_SP32, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0F); +#else + return retval_errno_edom(x); +#endif + } + + if (xneg) y = -x; + else y = x; + + transform = (xexp >= -1); /* abs(x) >= 0.5 */ + + if (transform) + { /* Transform y into the range [0,0.5) */ + r = 0.5F*(1.0F - y); +#ifdef WINDOWS + /* VC++ intrinsic call */ + _mm_store_ss(&s, _mm_sqrt_ss(_mm_load_ss(&r))); +#else + /* Hammer sqrt instruction */ + asm volatile ("sqrtss %1, %0" : "=x" (s) : "x" (r)); +#endif + y = s; + } + else + r = y*y; + + /* Use a rational approximation for [0.0, 0.5] */ + + u=r*(0.184161606965100694821398249421F + + (-0.0565298683201845211985026327361F + + (-0.0133819288943925804214011424456F - + 0.00396137437848476485201154797087F*r)*r)*r)/ + (1.10496961524520294485512696706F - + 0.836411276854206731913362287293F*r); + + if (transform) + { + /* Reconstruct acos carefully in transformed region */ + if (xneg) + return (float)(pi - 2.0*(s+(y*u - piby2_tail))); + else + { + float c, s1; + unsigned int us; + GET_BITS_SP32(s, us); + PUT_BITS_SP32(0xffff0000 & us, s1); + c = (r-s1*s1)/(s+s1); + return 2.0F*s1 + (2.0F*c+2.0F*y*u); + } + } + else + return (float)(piby2_head - (x - (piby2_tail - x*u))); +} + +weak_alias (__acosf, acosf)
diff --git a/src/acosh.c b/src/acosh.c new file mode 100644 index 0000000..f1d62c6 --- /dev/null +++ b/src/acosh.c
@@ -0,0 +1,447 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_NAN_WITH_FLAGS +#define USE_HANDLE_ERROR +#define USE_LOG_KERNEL_AMD +#include "../inc/libm_inlines_amd.h" +#undef USE_NAN_WITH_FLAGS +#undef USE_HANDLE_ERROR +#undef USE_LOG_KERNEL_AMD + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range argument */ +static inline double retval_errno_edom(double x) +{ + struct exception exc; + exc.arg1 = x; + exc.arg2 = x; + exc.type = DOMAIN; + exc.name = (char *)"acosh"; + if (_LIB_VERSION == _SVID_) + exc.retval = -HUGE; + else + exc.retval = nan_with_flags(AMD_F_INVALID); + if (_LIB_VERSION == _POSIX_) + __set_errno(EDOM); + else if (!matherr(&exc)) + { + if(_LIB_VERSION == _SVID_) + (void)fputs("acosh: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#undef _FUNCNAME +#define _FUNCNAME "acosh" +double FN_PROTOTYPE(acosh)(double x) +{ + + unsigned long long ux; + double r, rarg, r1, r2; + int xexp; + + static const unsigned long long + recrteps = 0x4196a09e667f3bcd; /* 1/sqrt(eps) = 9.49062656242515593767e+07 */ + /* log2_lead and log2_tail sum to an extra-precise version + of log(2) */ + + static const double + log2_lead = 6.93147122859954833984e-01, /* 0x3fe62e42e0000000 */ + log2_tail = 5.76999904754328540596e-08; /* 0x3e6efa39ef35793c */ + + + GET_BITS_DP64(x, ux); + + if ((ux & EXPBITS_DP64) == EXPBITS_DP64) + { + /* x is either NaN or infinity */ + if (ux & MANTBITS_DP64) + { + /* x is NaN */ +#ifdef WINDOWS + return handle_error(_FUNCNAME, ux|0x0008000000000000, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + } + else + { + /* x is infinity */ + if (ux & SIGNBIT_DP64) + /* x is negative infinity. Return a NaN. */ +#ifdef WINDOWS + return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return retval_errno_edom(x); +#endif + else + /* Return positive infinity with no signal */ + return x; + } + } + else if ((ux & SIGNBIT_DP64) || (ux <= 0x3ff0000000000000)) + { + /* x <= 1.0 */ + if (ux == 0x3ff0000000000000) + { + /* x = 1.0; return zero. */ + return 0.0; + } + else + { + /* x is less than 1.0. Return a NaN. */ +#ifdef WINDOWS + return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return retval_errno_edom(x); +#endif + } + } + + + if (ux > recrteps) + { + /* Arguments greater than 1/sqrt(epsilon) in magnitude are + approximated by acosh(x) = ln(2) + ln(x) */ + /* log_kernel_amd(x) returns xexp, r1, r2 such that + log(x) = xexp*log(2) + r1 + r2 */ + log_kernel_amd64(x, ux, &xexp, &r1, &r2); + /* Add (xexp+1) * log(2) to z1,z2 to get the result acosh(x). + The computed r1 is not subject to rounding error because + (xexp+1) has at most 10 significant bits, log(2) has 24 significant + bits, and r1 has up to 24 bits; and the exponents of r1 + and r2 differ by at most 6. */ + r1 = ((xexp+1) * log2_lead + r1); + r2 = ((xexp+1) * log2_tail + r2); + return r1 + r2; + } + else if (ux >= 0x4060000000000000) + { + /* 128.0 <= x <= 1/sqrt(epsilon) */ + /* acosh for these arguments is approximated by + acosh(x) = ln(x + sqrt(x*x-1)) */ + rarg = x*x-1.0; + /* Use assembly instruction to compute r = sqrt(rarg); */ + ASMSQRT(rarg,r); + r += x; + GET_BITS_DP64(r, ux); + log_kernel_amd64(r, ux, &xexp, &r1, &r2); + r1 = (xexp * log2_lead + r1); + r2 = (xexp * log2_tail + r2); + return r1 + r2; + } + else + { + /* 1.0 < x <= 128.0 */ + double u1, u2, v1, v2, w1, w2, hx, tx, t, r, s, p1, p2, a1, a2, c1, c2, + poly; + if (ux >= 0x3ff8000000000000) + { + /* 1.5 <= x <= 128.0 */ + /* We use minimax polynomials, + based on Abramowitz and Stegun 4.6.32 series + expansion for acosh(x), with the log(2x) and 1/(2.2.x^2) + terms removed. We compensate for these two terms later. + */ + t = x*x; + if (ux >= 0x4040000000000000) + { + /* [3,2] for 32.0 <= x <= 128.0 */ + poly = + (0.45995704464157438175e-9 + + (-0.89080839823528631030e-9 + + (-0.10370522395596168095e-27 + + 0.35255386405811106347e-32 * t) * t) * t) / + (0.21941191335882074014e-8 + + (-0.10185073058358334569e-7 + + 0.95019562478430648685e-8 * t) * t); + } + else if (ux >= 0x4020000000000000) + { + /* [3,3] for 8.0 <= x <= 32.0 */ + poly = + (-0.54903656589072526589e-10 + + (0.27646792387218569776e-9 + + (-0.26912957240626571979e-9 - + 0.86712268396736384286e-29 * t) * t) * t) / + (-0.24327683788655520643e-9 + + (0.20633757212593175571e-8 + + (-0.45438330985257552677e-8 + + 0.28707154390001678580e-8 * t) * t) * t); + } + else if (ux >= 0x4010000000000000) + { + /* [4,3] for 4.0 <= x <= 8.0 */ + poly = + (-0.20827370596738166108e-6 + + (0.10232136919220422622e-5 + + (-0.98094503424623656701e-6 + + (-0.11615338819596146799e-18 + + 0.44511847799282297160e-21 * t) * t) * t) * t) / + (-0.92579451630913718588e-6 + + (0.76997374707496606639e-5 + + (-0.16727286999128481170e-4 + + 0.10463413698762590251e-4 * t) * t) * t); + } + else if (ux >= 0x4000000000000000) + { + /* [5,5] for 2.0 <= x <= 4.0 */ + poly = + (-0.122195030526902362060e-7 + + (0.157894522814328933143e-6 + + (-0.579951798420930466109e-6 + + (0.803568881125803647331e-6 + + (-0.373906657221148667374e-6 - + 0.317856399083678204443e-21 * t) * t) * t) * t) * t) / + (-0.516260096352477148831e-7 + + (0.894662592315345689981e-6 + + (-0.475662774453078218581e-5 + + (0.107249291567405130310e-4 + + (-0.107871445525891289759e-4 + + 0.398833767702587224253e-5 * t) * t) * t) * t) * t); + } + else if (ux >= 0x3ffc000000000000) + { + /* [5,4] for 1.75 <= x <= 2.0 */ + poly = + (0.1437926821253825186e-3 + + (-0.1034078230246627213e-2 + + (0.2015310005461823437e-2 + + (-0.1159685218876828075e-2 + + (-0.9267353551307245327e-11 + + 0.2880267770324388034e-12 * t) * t) * t) * t) * t) / + (0.6305521447028109891e-3 + + (-0.6816525887775002944e-2 + + (0.2228081831550003651e-1 + + (-0.2836886105406603318e-1 + + 0.1236997707206036752e-1 * t) * t) * t) * t); + } + else + { + /* [5,4] for 1.5 <= x <= 1.75 */ + poly = + ( 0.7471936607751750826e-3 + + (-0.4849405284371905506e-2 + + (0.8823068059778393019e-2 + + (-0.4825395461288629075e-2 + + (-0.1001984320956564344e-8 + + 0.4299919281586749374e-10 * t) * t) * t) * t) * t) / + (0.3322359141239411478e-2 + + (-0.3293525930397077675e-1 + + (0.1011351440424239210e0 + + (-0.1227083591622587079e0 + + 0.5147099404383426080e-1 * t) * t) * t) * t); + } + GET_BITS_DP64(x, ux); + log_kernel_amd64(x, ux, &xexp, &r1, &r2); + r1 = ((xexp+1) * log2_lead + r1); + r2 = ((xexp+1) * log2_tail + r2); + /* Now (r1,r2) sum to log(2x). Subtract the term + 1/(2.2.x^2) = 0.25/t, and add poly/t, carefully + to maintain precision. (Note that we add poly/t + rather than poly because of the *x factor used + when generating the minimax polynomial) */ + v2 = (poly-0.25)/t; + r = v2 + r1; + s = ((r1 - r) + v2) + r2; + v1 = r + s; + return v1 + ((r - v1) + s); + } + + /* Here 1.0 <= x <= 1.5. It is hard to maintain accuracy here so + we have to go to great lengths to do so. */ + + /* We compute the value + t = x - 1.0 + sqrt(2.0*(x - 1.0) + (x - 1.0)*(x - 1.0)) + using simulated quad precision. */ + t = x - 1.0; + u1 = t * 2.0; + + /* dekker_mul12(t,t,&v1,&v2); */ + GET_BITS_DP64(t, ux); + ux &= 0xfffffffff8000000; + PUT_BITS_DP64(ux, hx); + tx = t - hx; + v1 = t * t; + v2 = (((hx * hx - v1) + hx * tx) + tx * hx) + tx * tx; + + /* dekker_add2(u1,0.0,v1,v2,&w1,&w2); */ + r = u1 + v1; + s = (((u1 - r) + v1) + v2); + w1 = r + s; + w2 = (r - w1) + s; + + /* dekker_sqrt2(w1,w2,&u1,&u2); */ + ASMSQRT(w1,p1); + GET_BITS_DP64(p1, ux); + ux &= 0xfffffffff8000000; + PUT_BITS_DP64(ux, c1); + c2 = p1 - c1; + a1 = p1 * p1; + a2 = (((c1 * c1 - a1) + c1 * c2) + c2 * c1) + c2 * c2; + p2 = (((w1 - a1) - a2) + w2) * 0.5 / p1; + u1 = p1 + p2; + u2 = (p1 - u1) + p2; + + /* dekker_add2(u1,u2,t,0.0,&v1,&v2); */ + r = u1 + t; + s = (((u1 - r) + t)) + u2; + r1 = r + s; + r2 = (r - r1) + s; + t = r1 + r2; + + /* Check for x close to 1.0. */ + if (x < 1.13) + { + /* Here 1.0 <= x < 1.13 implies r <= 0.656. In this region + we need to take extra care to maintain precision. + We have t = r1 + r2 = (x - 1.0 + sqrt(x*x-1.0)) + to more than basic precision. We use the Taylor series + for log(1+x), with terms after the O(x*x) term + approximated by a [6,6] minimax polynomial. */ + double b1, b2, c1, c2, e1, e2, q1, q2, c, cc, hr1, tr1, hpoly, tpoly, hq1, tq1, hr2, tr2; + poly = + (0.30893760556597282162e-21 + + (0.10513858797132174471e0 + + (0.27834538302122012381e0 + + (0.27223638654807468186e0 + + (0.12038958198848174570e0 + + (0.23357202004546870613e-1 + + (0.15208417992520237648e-2 + + 0.72741030690878441996e-7 * t) * t) * t) * t) * t) * t) * t) / + (0.31541576391396523486e0 + + (0.10715979719991342022e1 + + (0.14311581802952004012e1 + + (0.94928647994421895988e0 + + (0.32396235926176348977e0 + + (0.52566134756985833588e-1 + + 0.30477895574211444963e-2 * t) * t) * t) * t) * t) * t); + + /* Now we can compute the result r = acosh(x) = log1p(t) + using the formula t - 0.5*t*t + poly*t*t. Since t is + represented as r1+r2, the formula becomes + r = r1+r2 - 0.5*(r1+r2)*(r1+r2) + poly*(r1+r2)*(r1+r2). + Expanding out, we get + r = r1 + r2 - (0.5 + poly)*(r1*r1 + 2*r1*r2 + r2*r2) + and ignoring negligible quantities we get + r = r1 + r2 - 0.5*r1*r1 + r1*r2 + poly*t*t + */ + if (x < 1.06) + { + double b, c, e; + b = r1*r2; + c = 0.5*r1*r1; + e = poly*t*t; + /* N.B. the order of additions and subtractions is important */ + r = (((r2 - b) + e) - c) + r1; + return r; + } + else + { + /* For 1.06 <= x <= 1.13 we must evaluate in extended precision + to reach about 1 ulp accuracy (in this range the simple code + above only manages about 1.5 ulp accuracy) */ + + /* Split poly, r1 and r2 into head and tail sections */ + GET_BITS_DP64(poly, ux); + ux &= 0xfffffffff8000000; + PUT_BITS_DP64(ux,hpoly); + tpoly = poly - hpoly; + GET_BITS_DP64(r1,ux); + ux &= 0xfffffffff8000000; + PUT_BITS_DP64(ux,hr1); + tr1 = r1 - hr1; + GET_BITS_DP64(r2, ux); + ux &= 0xfffffffff8000000; + PUT_BITS_DP64(ux,hr2); + tr2 = r2 - hr2; + + /* e = poly*t*t */ + c = poly * r1; + cc = (((hpoly * hr1 - c) + hpoly * tr1) + tpoly * hr1) + tpoly * tr1; + cc = poly * r2 + cc; + q1 = c + cc; + q2 = (c - q1) + cc; + GET_BITS_DP64(q1, ux); + ux &= 0xfffffffff8000000; + PUT_BITS_DP64(ux,hq1); + tq1 = q1 - hq1; + c = q1 * r1; + cc = (((hq1 * hr1 - c) + hq1 * tr1) + tq1 * hr1) + tq1 * tr1; + cc = q1 * r2 + q2 * r1 + cc; + e1 = c + cc; + e2 = (c - e1) + cc; + + /* b = r1*r2 */ + b1 = r1 * r2; + b2 = (((hr1 * hr2 - b1) + hr1 * tr2) + tr1 * hr2) + tr1 * tr2; + + /* c = 0.5*r1*r1 */ + c1 = (0.5*r1) * r1; + c2 = (((0.5*hr1 * hr1 - c1) + 0.5*hr1 * tr1) + 0.5*tr1 * hr1) + 0.5*tr1 * tr1; + + /* v = a + d - b */ + r = r1 - b1; + s = (((r1 - r) - b1) - b2) + r2; + v1 = r + s; + v2 = (r - v1) + s; + + /* w = (a + d - b) - c */ + r = v1 - c1; + s = (((v1 - r) - c1) - c2) + v2; + w1 = r + s; + w2 = (r - w1) + s; + + /* u = ((a + d - b) - c) + e */ + r = w1 + e1; + s = (((w1 - r) + e1) + e2) + w2; + u1 = r + s; + u2 = (r - u1) + s; + + /* The result r = acosh(x) */ + r = u1 + u2; + + return r; + } + } + else + { + /* For arguments 1.13 <= x <= 1.5 the log1p function + is good enough */ + return FN_PROTOTYPE(log1p)(t); + } + } +} + +weak_alias (__acosh, acosh)
diff --git a/src/acoshf.c b/src/acoshf.c new file mode 100644 index 0000000..c96fdb0 --- /dev/null +++ b/src/acoshf.c
@@ -0,0 +1,149 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include <stdio.h> + +#define USE_NANF_WITH_FLAGS +#define USE_HANDLE_ERRORF +#include "../inc/libm_inlines_amd.h" +#undef USE_NANF_WITH_FLAGS +#undef USE_HANDLE_ERRORF + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range argument */ +static inline float retval_errno_edom(float x) +{ + struct exception exc; + exc.arg1 = (double)x; + exc.arg2 = (double)x; + exc.type = DOMAIN; + exc.name = (char *)"acoshf"; + if (_LIB_VERSION == _SVID_) + exc.retval = -HUGE; + else + exc.retval = nanf_with_flags(AMD_F_INVALID); + if (_LIB_VERSION == _POSIX_) + __set_errno(EDOM); + else if (!matherr(&exc)) + { + if(_LIB_VERSION == _SVID_) + (void)fputs("acoshf: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#undef _FUNCNAME +#define _FUNCNAME "acoshf" +float FN_PROTOTYPE(acoshf)(float x) +{ + + unsigned int ux; + double dx, r, rarg, t; + + static const unsigned int + recrteps = 0x46000000; /* 1/sqrt(eps) = 4.09600000000000000000e+03 */ + + static const double + log2 = 6.93147180559945286227e-01; /* 0x3fe62e42fefa39ef */ + + GET_BITS_SP32(x, ux); + + if ((ux & EXPBITS_SP32) == EXPBITS_SP32) + { + /* x is either NaN or infinity */ + if (ux & MANTBITS_SP32) + { + /* x is NaN */ +#ifdef WINDOWS + return handle_errorf(_FUNCNAME, ux|0x00400000, _DOMAIN, + 0, EDOM, x, 0.0F); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + } + else + { + /* x is infinity */ + if (ux & SIGNBIT_SP32) + /* x is negative infinity. Return a NaN. */ +#ifdef WINDOWS + return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0F); +#else + return retval_errno_edom(x); +#endif + else + /* Return positive infinity with no signal */ + return x; + } + } + else if ((ux & SIGNBIT_SP32) || (ux < 0x3f800000)) + { + /* x is less than 1.0. Return a NaN. */ +#ifdef WINDOWS + return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0F); +#else + return retval_errno_edom(x); +#endif + } + + dx = x; + + if (ux > recrteps) + { + /* Arguments greater than 1/sqrt(epsilon) in magnitude are + approximated by acoshf(x) = ln(2) + ln(x) */ + r = FN_PROTOTYPE(log)(dx) + log2; + } + else if (ux > 0x40000000) + { + /* 2.0 <= x <= 1/sqrt(epsilon) */ + /* acoshf for these arguments is approximated by + acoshf(x) = ln(x + sqrt(x*x-1)) */ + rarg = dx*dx-1.0; + /* Use assembly instruction to compute r = sqrt(rarg); */ + ASMSQRT(rarg,r); + rarg = r + dx; + r = FN_PROTOTYPE(log)(rarg); + } + else + { + /* sqrt(epsilon) <= x <= 2.0 */ + t = dx - 1.0; + rarg = 2.0*t + t*t; + ASMSQRT(rarg,r); /* r = sqrt(rarg) */ + rarg = t + r; + r = FN_PROTOTYPE(log1p)(rarg); + } + return (float)(r); +} + +weak_alias (__acoshf, acoshf)
diff --git a/src/asin.c b/src/asin.c new file mode 100644 index 0000000..0314dd8 --- /dev/null +++ b/src/asin.c
@@ -0,0 +1,196 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_VAL_WITH_FLAGS +#define USE_NAN_WITH_FLAGS +#define USE_HANDLE_ERROR +#include "../inc/libm_inlines_amd.h" +#undef USE_NAN_WITH_FLAGS +#undef USE_VAL_WITH_FLAGS +#undef USE_HANDLE_ERROR + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range argument */ +static inline double retval_errno_edom(double x) +{ + struct exception exc; + exc.arg1 = x; + exc.arg2 = x; + exc.type = DOMAIN; + exc.name = (char *)"asin"; + if (_LIB_VERSION == _SVID_) + exc.retval = HUGE; + else + exc.retval = nan_with_flags(AMD_F_INVALID); + if (_LIB_VERSION == _POSIX_) + __set_errno(EDOM); + else if (!matherr(&exc)) + { + if(_LIB_VERSION == _SVID_) + (void)fputs("asin: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#ifdef WINDOWS +#pragma function(asin) +#endif + +double FN_PROTOTYPE(asin)(double x) +{ + /* Computes arcsin(x). + The argument is first reduced by noting that arcsin(x) + is invalid for abs(x) > 1 and arcsin(-x) = -arcsin(x). + For denormal and small arguments arcsin(x) = x to machine + accuracy. Remaining argument ranges are handled as follows. + For abs(x) <= 0.5 use + arcsin(x) = x + x^3*R(x^2) + where R(x^2) is a rational minimax approximation to + (arcsin(x) - x)/x^3. + For abs(x) > 0.5 exploit the identity: + arcsin(x) = pi/2 - 2*arcsin(sqrt(1-x)/2) + together with the above rational approximation, and + reconstruct the terms carefully. + */ + + /* Some constants and split constants. */ + + static const double + piby2_tail = 6.1232339957367660e-17, /* 0x3c91a62633145c07 */ + hpiby2_head = 7.8539816339744831e-01, /* 0x3fe921fb54442d18 */ + piby2 = 1.5707963267948965e+00; /* 0x3ff921fb54442d18 */ + double u, v, y, s=0.0, r; + int xexp, xnan, transform=0; + + unsigned long long ux, aux, xneg; + GET_BITS_DP64(x, ux); + aux = ux & ~SIGNBIT_DP64; + xneg = (ux & SIGNBIT_DP64); + xnan = (aux > PINFBITPATT_DP64); + xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64; + + /* Special cases */ + + if (xnan) + { +#ifdef WINDOWS + return handle_error("asin", ux|0x0008000000000000, _DOMAIN, + 0, EDOM, x, 0.0); +#else + return x + x; /* With invalid if it's a signalling NaN */ +#endif + } + else if (xexp < -28) + { /* y small enough that arcsin(x) = x */ + return val_with_flags(x, AMD_F_INEXACT); + } + else if (xexp >= 0) + { /* abs(x) >= 1.0 */ + if (x == 1.0) + return val_with_flags(piby2, AMD_F_INEXACT); + else if (x == -1.0) + return val_with_flags(-piby2, AMD_F_INEXACT); + else +#ifdef WINDOWS + return handle_error("asin", INDEFBITPATT_DP64, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return retval_errno_edom(x); +#endif + } + + if (xneg) y = -x; + else y = x; + + transform = (xexp >= -1); /* abs(x) >= 0.5 */ + + if (transform) + { /* Transform y into the range [0,0.5) */ + r = 0.5*(1.0 - y); +#ifdef WINDOWS + /* VC++ intrinsic call */ + _mm_store_sd(&s, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&r))); +#else + /* Hammer sqrt instruction */ + asm volatile ("sqrtsd %1, %0" : "=x" (s) : "x" (r)); +#endif + y = s; + } + else + r = y*y; + + /* Use a rational approximation for [0.0, 0.5] */ + + u = r*(0.227485835556935010735943483075 + + (-0.445017216867635649900123110649 + + (0.275558175256937652532686256258 + + (-0.0549989809235685841612020091328 + + (0.00109242697235074662306043804220 + + 0.0000482901920344786991880522822991*r)*r)*r)*r)*r)/ + (1.36491501334161032038194214209 + + (-3.28431505720958658909889444194 + + (2.76568859157270989520376345954 + + (-0.943639137032492685763471240072 + + 0.105869422087204370341222318533*r)*r)*r)*r); + + if (transform) + { /* Reconstruct asin carefully in transformed region */ + { + double c, s1, p, q; + unsigned long long us; + GET_BITS_DP64(s, us); + PUT_BITS_DP64(0xffffffff00000000 & us, s1); + c = (r-s1*s1)/(s+s1); + p = 2.0*s*u - (piby2_tail-2.0*c); + q = hpiby2_head - 2.0*s1; + v = hpiby2_head - (p-q); + } + } + else + { +#ifdef WINDOWS + /* Use a temporary variable to prevent VC++ rearranging + y + y*u + into + y * (1 + u) + and getting an incorrectly rounded result */ + double tmp; + tmp = y * u; + v = y + tmp; +#else + v = y + y*u; +#endif + } + + if (xneg) return -v; + else return v; +} + +weak_alias (__asin, asin)
diff --git a/src/asinf.c b/src/asinf.c new file mode 100644 index 0000000..4b42b01 --- /dev/null +++ b/src/asinf.c
@@ -0,0 +1,190 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_VALF_WITH_FLAGS +#define USE_NANF_WITH_FLAGS +#define USE_HANDLE_ERRORF +#include "../inc/libm_inlines_amd.h" +#undef USE_NANF_WITH_FLAGS +#undef USE_VALF_WITH_FLAGS +#undef USE_HANDLE_ERRORF + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range argument */ +static inline float retval_errno_edom(float x) +{ + struct exception exc; + exc.arg1 = (double)x; + exc.arg2 = (double)x; + exc.type = DOMAIN; + exc.name = (char *)"asinf"; + if (_LIB_VERSION == _SVID_) + exc.retval = HUGE; + else + exc.retval = nanf_with_flags(AMD_F_INVALID); + if (_LIB_VERSION == _POSIX_) + __set_errno(EDOM); + else if (!matherr(&exc)) + { + if(_LIB_VERSION == _SVID_) + (void)fputs("asinf: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#ifdef WINDOWS +#pragma function(asinf) +#endif + +float FN_PROTOTYPE(asinf)(float x) +{ + /* Computes arcsin(x). + The argument is first reduced by noting that arcsin(x) + is invalid for abs(x) > 1 and arcsin(-x) = -arcsin(x). + For denormal and small arguments arcsin(x) = x to machine + accuracy. Remaining argument ranges are handled as follows. + For abs(x) <= 0.5 use + arcsin(x) = x + x^3*R(x^2) + where R(x^2) is a rational minimax approximation to + (arcsin(x) - x)/x^3. + For abs(x) > 0.5 exploit the identity: + arcsin(x) = pi/2 - 2*arcsin(sqrt(1-x)/2) + together with the above rational approximation, and + reconstruct the terms carefully. + */ + + /* Some constants and split constants. */ + + static const float + piby2_tail = 7.5497894159e-08F, /* 0x33a22168 */ + hpiby2_head = 7.8539812565e-01F, /* 0x3f490fda */ + piby2 = 1.5707963705e+00F; /* 0x3fc90fdb */ + float u, v, y, s = 0.0F, r; + int xexp, xnan, transform = 0; + + unsigned int ux, aux, xneg; + GET_BITS_SP32(x, ux); + aux = ux & ~SIGNBIT_SP32; + xneg = (ux & SIGNBIT_SP32); + xnan = (aux > PINFBITPATT_SP32); + xexp = (int)((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32; + + /* Special cases */ + + if (xnan) + { +#ifdef WINDOWS + return handle_errorf("asinf", ux|0x00400000, _DOMAIN, 0, + EDOM, x, 0.0F); +#else + return x + x; /* With invalid if it's a signalling NaN */ +#endif + } + else if (xexp < -14) + /* y small enough that arcsin(x) = x */ + return valf_with_flags(x, AMD_F_INEXACT); + else if (xexp >= 0) + { + /* abs(x) >= 1.0 */ + if (x == 1.0F) + return valf_with_flags(piby2, AMD_F_INEXACT); + else if (x == -1.0F) + return valf_with_flags(-piby2, AMD_F_INEXACT); + else +#ifdef WINDOWS + return handle_errorf("asinf", INDEFBITPATT_SP32, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0F); +#else + return retval_errno_edom(x); +#endif + } + + if (xneg) y = -x; + else y = x; + + transform = (xexp >= -1); /* abs(x) >= 0.5 */ + + if (transform) + { /* Transform y into the range [0,0.5) */ + r = 0.5F*(1.0F - y); +#ifdef WINDOWS + /* VC++ intrinsic call */ + _mm_store_ss(&s, _mm_sqrt_ss(_mm_load_ss(&r))); +#else + /* Hammer sqrt instruction */ + asm volatile ("sqrtss %1, %0" : "=x" (s) : "x" (r)); +#endif + y = s; + } + else + r = y*y; + + /* Use a rational approximation for [0.0, 0.5] */ + + u=r*(0.184161606965100694821398249421F + + (-0.0565298683201845211985026327361F + + (-0.0133819288943925804214011424456F - + 0.00396137437848476485201154797087F*r)*r)*r)/ + (1.10496961524520294485512696706F - + 0.836411276854206731913362287293F*r); + + if (transform) + { + /* Reconstruct asin carefully in transformed region */ + float c, s1, p, q; + unsigned int us; + GET_BITS_SP32(s, us); + PUT_BITS_SP32(0xffff0000 & us, s1); + c = (r-s1*s1)/(s+s1); + p = 2.0F*s*u - (piby2_tail-2.0F*c); + q = hpiby2_head - 2.0F*s1; + v = hpiby2_head - (p-q); + } + else + { +#ifdef WINDOWS + /* Use a temporary variable to prevent VC++ rearranging + y + y*u + into + y * (1 + u) + and getting an incorrectly rounded result */ + float tmp; + tmp = y * u; + v = y + tmp; +#else + v = y + y*u; +#endif + } + + if (xneg) return -v; + else return v; +} + +weak_alias (__asinf, asinf)
diff --git a/src/asinh.c b/src/asinh.c new file mode 100644 index 0000000..7ecde9c --- /dev/null +++ b/src/asinh.c
@@ -0,0 +1,322 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_HANDLE_ERROR +#define USE_LOG_KERNEL_AMD +#define USE_VAL_WITH_FLAGS +#include "../inc/libm_inlines_amd.h" +#undef USE_HANDLE_ERROR +#undef USE_LOG_KERNEL_AMD +#undef VAL_WITH_FLAGS + +#undef _FUNCNAME +#define _FUNCNAME "asinh" +double FN_PROTOTYPE(asinh)(double x) +{ + + unsigned long long ux, ax, xneg; + double absx, r, rarg, t, r1, r2, poly, s, v1, v2; + int xexp; + + static const unsigned long long + rteps = 0x3e46a09e667f3bcd, /* sqrt(eps) = 1.05367121277235086670e-08 */ + recrteps = 0x4196a09e667f3bcd; /* 1/rteps = 9.49062656242515593767e+07 */ + + /* log2_lead and log2_tail sum to an extra-precise version + of log(2) */ + static const double + log2_lead = 6.93147122859954833984e-01, /* 0x3fe62e42e0000000 */ + log2_tail = 5.76999904754328540596e-08; /* 0x3e6efa39ef35793c */ + + + GET_BITS_DP64(x, ux); + ax = ux & ~SIGNBIT_DP64; + xneg = ux & SIGNBIT_DP64; + PUT_BITS_DP64(ax, absx); + + if ((ux & EXPBITS_DP64) == EXPBITS_DP64) + { + /* x is either NaN or infinity */ + if (ux & MANTBITS_DP64) + { + /* x is NaN */ +#ifdef WINDOWS + return handle_error(_FUNCNAME, ux|0x0008000000000000, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + } + else + { + /* x is infinity. Return the same infinity. */ +#ifdef WINDOWS + if (ux & SIGNBIT_DP64) + return handle_error(_FUNCNAME, NINFBITPATT_DP64, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); + else + return handle_error(_FUNCNAME, PINFBITPATT_DP64, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return x; +#endif + } + } + else if (ax < rteps) /* abs(x) < sqrt(epsilon) */ + { + if (ax == 0x0000000000000000) + { + /* x is +/-zero. Return the same zero. */ + return x; + } + else + { + /* Tiny arguments approximated by asinh(x) = x + - avoid slow operations on denormalized numbers */ + return val_with_flags(x,AMD_F_INEXACT); + } + } + + + if (ax <= 0x3ff0000000000000) /* abs(x) <= 1.0 */ + { + /* Arguments less than 1.0 in magnitude are + approximated by [4,4] or [5,4] minimax polynomials + fitted to asinh series 4.6.31 (x < 1) from Abramowitz and Stegun + */ + t = x*x; + if (ax < 0x3fd0000000000000) + { + /* [4,4] for 0 < abs(x) < 0.25 */ + poly = + (-0.12845379283524906084997e0 + + (-0.21060688498409799700819e0 + + (-0.10188951822578188309186e0 + + (-0.13891765817243625541799e-1 - + 0.10324604871728082428024e-3 * t) * t) * t) * t) / + (0.77072275701149440164511e0 + + (0.16104665505597338100747e1 + + (0.11296034614816689554875e1 + + (0.30079351943799465092429e0 + + 0.235224464765951442265117e-1 * t) * t) * t) * t); + } + else if (ax < 0x3fe0000000000000) + { + /* [4,4] for 0.25 <= abs(x) < 0.5 */ + poly = + (-0.12186605129448852495563e0 + + (-0.19777978436593069928318e0 + + (-0.94379072395062374824320e-1 + + (-0.12620141363821680162036e-1 - + 0.903396794842691998748349e-4 * t) * t) * t) * t) / + (0.73119630776696495279434e0 + + (0.15157170446881616648338e1 + + (0.10524909506981282725413e1 + + (0.27663713103600182193817e0 + + 0.21263492900663656707646e-1 * t) * t) * t) * t); + } + else if (ax < 0x3fe8000000000000) + { + /* [4,4] for 0.5 <= abs(x) < 0.75 */ + poly = + (-0.81210026327726247622500e-1 + + (-0.12327355080668808750232e0 + + (-0.53704925162784720405664e-1 + + (-0.63106739048128554465450e-2 - + 0.35326896180771371053534e-4 * t) * t) * t) * t) / + (0.48726015805581794231182e0 + + (0.95890837357081041150936e0 + + (0.62322223426940387752480e0 + + (0.15028684818508081155141e0 + + 0.10302171620320141529445e-1 * t) * t) * t) * t); + } + else + { + /* [5,4] for 0.75 <= abs(x) <= 1.0 */ + poly = + (-0.4638179204422665073e-1 + + (-0.7162729496035415183e-1 + + (-0.3247795155696775148e-1 + + (-0.4225785421291932164e-2 + + (-0.3808984717603160127e-4 + + 0.8023464184964125826e-6 * t) * t) * t) * t) * t) / + (0.2782907534642231184e0 + + (0.5549945896829343308e0 + + (0.3700732511330698879e0 + + (0.9395783438240780722e-1 + + 0.7200057974217143034e-2 * t) * t) * t) * t); + } + return x + x*t*poly; + } + else if (ax < 0x4040000000000000) + { + /* 1.0 <= abs(x) <= 32.0 */ + /* Arguments in this region are approximated by various + minimax polynomials fitted to asinh series 4.6.31 + in Abramowitz and Stegun. + */ + t = x*x; + if (ax >= 0x4020000000000000) + { + /* [3,3] for 8.0 <= abs(x) <= 32.0 */ + poly = + (-0.538003743384069117e-10 + + (-0.273698654196756169e-9 + + (-0.268129826956403568e-9 - + 0.804163374628432850e-29 * t) * t) * t) / + (0.238083376363471960e-9 + + (0.203579344621125934e-8 + + (0.450836980450693209e-8 + + 0.286005148753497156e-8 * t) * t) * t); + } + else if (ax >= 0x4010000000000000) + { + /* [4,3] for 4.0 <= abs(x) <= 8.0 */ + poly = + (-0.178284193496441400e-6 + + (-0.928734186616614974e-6 + + (-0.923318925566302615e-6 + + (-0.776417026702577552e-19 + + 0.290845644810826014e-21 * t) * t) * t) * t) / + (0.786694697277890964e-6 + + (0.685435665630965488e-5 + + (0.153780175436788329e-4 + + 0.984873520613417917e-5 * t) * t) * t); + + } + else if (ax >= 0x4000000000000000) + { + /* [5,4] for 2.0 <= abs(x) <= 4.0 */ + poly = + (-0.209689451648100728e-6 + + (-0.219252358028695992e-5 + + (-0.551641756327550939e-5 + + (-0.382300259826830258e-5 + + (-0.421182121910667329e-17 + + 0.492236019998237684e-19 * t) * t) * t) * t) * t) / + (0.889178444424237735e-6 + + (0.131152171690011152e-4 + + (0.537955850185616847e-4 + + (0.814966175170941864e-4 + + 0.407786943832260752e-4 * t) * t) * t) * t); + } + else if (ax >= 0x3ff8000000000000) + { + /* [5,4] for 1.5 <= abs(x) <= 2.0 */ + poly = + (-0.195436610112717345e-4 + + (-0.233315515113382977e-3 + + (-0.645380957611087587e-3 + + (-0.478948863920281252e-3 + + (-0.805234112224091742e-12 + + 0.246428598194879283e-13 * t) * t) * t) * t) * t) / + (0.822166621698664729e-4 + + (0.135346265620413852e-2 + + (0.602739242861830658e-2 + + (0.972227795510722956e-2 + + 0.510878800983771167e-2 * t) * t) * t) * t); + } + else + { + /* [5,5] for 1.0 <= abs(x) <= 1.5 */ + poly = + (-0.121224194072430701e-4 + + (-0.273145455834305218e-3 + + (-0.152866982560895737e-2 + + (-0.292231744584913045e-2 + + (-0.174670900236060220e-2 - + 0.891754209521081538e-12 * t) * t) * t) * t) * t) / + (0.499426632161317606e-4 + + (0.139591210395547054e-2 + + (0.107665231109108629e-1 + + (0.325809818749873406e-1 + + (0.415222526655158363e-1 + + 0.186315628774716763e-1 * t) * t) * t) * t) * t); + } + log_kernel_amd64(absx, ax, &xexp, &r1, &r2); + r1 = ((xexp+1) * log2_lead + r1); + r2 = ((xexp+1) * log2_tail + r2); + /* Now (r1,r2) sum to log(2x). Add the term + 1/(2.2.x^2) = 0.25/t, and add poly/t, carefully + to maintain precision. (Note that we add poly/t + rather than poly because of the *x factor used + when generating the minimax polynomial) */ + v2 = (poly+0.25)/t; + r = v2 + r1; + s = ((r1 - r) + v2) + r2; + v1 = r + s; + v2 = (r - v1) + s; + r = v1 + v2; + if (xneg) + return -r; + else + return r; + } + else + { + /* abs(x) > 32.0 */ + if (ax > recrteps) + { + /* Arguments greater than 1/sqrt(epsilon) in magnitude are + approximated by asinh(x) = ln(2) + ln(abs(x)), with sign of x */ + /* log_kernel_amd(x) returns xexp, r1, r2 such that + log(x) = xexp*log(2) + r1 + r2 */ + log_kernel_amd64(absx, ax, &xexp, &r1, &r2); + /* Add (xexp+1) * log(2) to z1,z2 to get the result asinh(x). + The computed r1 is not subject to rounding error because + (xexp+1) has at most 10 significant bits, log(2) has 24 significant + bits, and r1 has up to 24 bits; and the exponents of r1 + and r2 differ by at most 6. */ + r1 = ((xexp+1) * log2_lead + r1); + r2 = ((xexp+1) * log2_tail + r2); + if (xneg) + return -(r1 + r2); + else + return r1 + r2; + } + else + { + rarg = absx*absx+1.0; + /* Arguments such that 32.0 <= abs(x) <= 1/sqrt(epsilon) are + approximated by + asinh(x) = ln(abs(x) + sqrt(x*x+1)) + with the sign of x (see Abramowitz and Stegun 4.6.20) */ + /* Use assembly instruction to compute r = sqrt(rarg); */ + ASMSQRT(rarg,r); + r += absx; + GET_BITS_DP64(r, ax); + log_kernel_amd64(r, ax, &xexp, &r1, &r2); + r1 = (xexp * log2_lead + r1); + r2 = (xexp * log2_tail + r2); + if (xneg) + return -(r1 + r2); + else + return r1 + r2; + } + } +} + +weak_alias (__asinh, asinh)
diff --git a/src/asinhf.c b/src/asinhf.c new file mode 100644 index 0000000..f5d3bf9 --- /dev/null +++ b/src/asinhf.c
@@ -0,0 +1,164 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include <stdio.h> + +#define USE_HANDLE_ERRORF +#define USE_VALF_WITH_FLAGS +#include "../inc/libm_inlines_amd.h" +#undef USE_HANDLE_ERRORF +#undef VALF_WITH_FLAGS + +#undef _FUNCNAME +#define _FUNCNAME "asinhf" +float FN_PROTOTYPE(asinhf)(float x) +{ + + double dx; + unsigned int ux, ax, xneg; + double absx, r, rarg, t, poly; + + static const unsigned int + rteps = 0x39800000, /* sqrt(eps) = 2.44140625000000000000e-04 */ + recrteps = 0x46000000; /* 1/rteps = 4.09600000000000000000e+03 */ + + static const double + log2 = 6.93147180559945286227e-01; /* 0x3fe62e42fefa39ef */ + + GET_BITS_SP32(x, ux); + ax = ux & ~SIGNBIT_SP32; + xneg = ux & SIGNBIT_SP32; + + if ((ux & EXPBITS_SP32) == EXPBITS_SP32) + { + /* x is either NaN or infinity */ + if (ux & MANTBITS_SP32) + { + /* x is NaN */ +#ifdef WINDOWS + return handle_errorf(_FUNCNAME, ux|0x00400000, _DOMAIN, + 0, EDOM, x, 0.0F); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + } + else + { + /* x is infinity. Return the same infinity. */ +#ifdef WINDOWS + if (ux & SIGNBIT_SP32) + return handle_errorf(_FUNCNAME, NINFBITPATT_SP32, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0F); + else + return handle_errorf(_FUNCNAME, PINFBITPATT_SP32, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0F); +#else + return x; +#endif + } + } + else if (ax < rteps) /* abs(x) < sqrt(epsilon) */ + { + if (ax == 0x00000000) + { + /* x is +/-zero. Return the same zero. */ + return x; + } + else + { + /* Tiny arguments approximated by asinhf(x) = x + - avoid slow operations on denormalized numbers */ + return valf_with_flags(x,AMD_F_INEXACT); + } + } + + dx = x; + if (xneg) + absx = -dx; + else + absx = dx; + + if (ax <= 0x40800000) /* abs(x) <= 4.0 */ + { + /* Arguments less than 4.0 in magnitude are + approximated by [4,4] minimax polynomials + */ + t = dx*dx; + if (ax <= 0x40000000) /* abs(x) <= 2 */ + poly = + (-0.1152965835871758072e-1 + + (-0.1480204186473758321e-1 + + (-0.5063201055468483248e-2 + + (-0.4162727710583425360e-3 - + 0.1177198915954942694e-5 * t) * t) * t) * t) / + (0.6917795026025976739e-1 + + (0.1199423176003939087e+0 + + (0.6582362487198468066e-1 + + (0.1260024978680227945e-1 + + 0.6284381367285534560e-3 * t) * t) * t) * t); + else + poly = + (-0.185462290695578589e-2 + + (-0.113672533502734019e-2 + + (-0.142208387300570402e-3 + + (-0.339546014993079977e-5 - + 0.151054665394480990e-8 * t) * t) * t) * t) / + (0.111486158580024771e-1 + + (0.117782437980439561e-1 + + (0.325903773532674833e-2 + + (0.255902049924065424e-3 + + 0.434150786948890837e-5 * t) * t) * t) * t); + return (float)(dx + dx*t*poly); + } + else + { + /* abs(x) > 4.0 */ + if (ax > recrteps) + { + /* Arguments greater than 1/sqrt(epsilon) in magnitude are + approximated by asinhf(x) = ln(2) + ln(abs(x)), with sign of x */ + r = FN_PROTOTYPE(log)(absx) + log2; + } + else + { + rarg = absx*absx+1.0; + /* Arguments such that 4.0 <= abs(x) <= 1/sqrt(epsilon) are + approximated by + asinhf(x) = ln(abs(x) + sqrt(x*x+1)) + with the sign of x (see Abramowitz and Stegun 4.6.20) */ + /* Use assembly instruction to compute r = sqrt(rarg); */ + ASMSQRT(rarg,r); + r += absx; + r = FN_PROTOTYPE(log)(r); + } + if (xneg) + return (float)(-r); + else + return (float)r; + } +} + +weak_alias (__asinhf, asinhf)
diff --git a/src/atan.c b/src/atan.c new file mode 100644 index 0000000..3b99df9 --- /dev/null +++ b/src/atan.c
@@ -0,0 +1,171 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_VAL_WITH_FLAGS +#define USE_NAN_WITH_FLAGS +#define USE_HANDLE_ERROR +#include "../inc/libm_inlines_amd.h" +#undef USE_VAL_WITH_FLAGS +#undef USE_NAN_WITH_FLAGS +#undef USE_HANDLE_ERROR + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range argument */ +static inline double retval_errno_edom(double x) +{ + struct exception exc; + exc.arg1 = x; + exc.arg2 = x; + exc.name = (char *)"atan"; + exc.type = DOMAIN; + if (_LIB_VERSION == _SVID_) + exc.retval = HUGE; + else + exc.retval = nan_with_flags(AMD_F_INVALID); + if (_LIB_VERSION == _POSIX_) + __set_errno(EDOM); + else if (!matherr(&exc)) + { + if(_LIB_VERSION == _SVID_) + (void)fputs("atan: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#ifdef WINDOWS +#pragma function(atan) +#endif + +double FN_PROTOTYPE(atan)(double x) +{ + + /* Some constants and split constants. */ + + static double piby2 = 1.5707963267948966e+00; /* 0x3ff921fb54442d18 */ + double chi, clo, v, s, q, z; + + /* Find properties of argument x. */ + + unsigned long long ux, aux, xneg; + GET_BITS_DP64(x, ux); + aux = ux & ~SIGNBIT_DP64; + xneg = (ux != aux); + + if (xneg) v = -x; + else v = x; + + /* Argument reduction to range [-7/16,7/16] */ + + if (aux < 0x3e50000000000000) /* v < 2.0^(-26) */ + { + /* x is a good approximation to atan(x) and avoids working on + intermediate denormal numbers */ + if (aux == 0x0000000000000000) + return x; + else + return val_with_flags(x, AMD_F_INEXACT); + } + else if (aux > 0x4003800000000000) /* v > 39./16. */ + { + + if (aux > PINFBITPATT_DP64) + { + /* x is NaN */ +#ifdef WINDOWS + return handle_error("atan", ux|0x0008000000000000, _DOMAIN, 0, + EDOM, x, 0.0); +#else + return x + x; /* Raise invalid if it's a signalling NaN */ +#endif + } + else if (aux > 0x4370000000000000) + { /* abs(x) > 2^56 => arctan(1/x) is + insignificant compared to piby2 */ + if (xneg) + return val_with_flags(-piby2, AMD_F_INEXACT); + else + return val_with_flags(piby2, AMD_F_INEXACT); + } + + x = -1.0/v; + /* (chi + clo) = arctan(infinity) */ + chi = 1.57079632679489655800e+00; /* 0x3ff921fb54442d18 */ + clo = 6.12323399573676480327e-17; /* 0x3c91a62633145c06 */ + } + else if (aux > 0x3ff3000000000000) /* 39./16. > v > 19./16. */ + { + x = (v-1.5)/(1.0+1.5*v); + /* (chi + clo) = arctan(1.5) */ + chi = 9.82793723247329054082e-01; /* 0x3fef730bd281f69b */ + clo = 1.39033110312309953701e-17; /* 0x3c7007887af0cbbc */ + } + else if (aux > 0x3fe6000000000000) /* 19./16. > v > 11./16. */ + { + x = (v-1.0)/(1.0+v); + /* (chi + clo) = arctan(1.) */ + chi = 7.85398163397448278999e-01; /* 0x3fe921fb54442d18 */ + clo = 3.06161699786838240164e-17; /* 0x3c81a62633145c06 */ + } + else if (aux > 0x3fdc000000000000) /* 11./16. > v > 7./16. */ + { + x = (2.0*v-1.0)/(2.0+v); + /* (chi + clo) = arctan(0.5) */ + chi = 4.63647609000806093515e-01; /* 0x3fddac670561bb4f */ + clo = 2.26987774529616809294e-17; /* 0x3c7a2b7f222f65e0 */ + } + else /* v < 7./16. */ + { + x = v; + chi = 0.0; + clo = 0.0; + } + + /* Core approximation: Remez(4,4) on [-7/16,7/16] */ + + s = x*x; + q = x*s* + (0.268297920532545909e0 + + (0.447677206805497472e0 + + (0.220638780716667420e0 + + (0.304455919504853031e-1 + + 0.142316903342317766e-3*s)*s)*s)*s)/ + (0.804893761597637733e0 + + (0.182596787737507063e1 + + (0.141254259931958921e1 + + (0.424602594203847109e0 + + 0.389525873944742195e-1*s)*s)*s)*s); + + z = chi - ((q - clo) - x); + + if (xneg) z = -z; + return z; +} + +weak_alias (__atan, atan)
diff --git a/src/atan2.c b/src/atan2.c new file mode 100644 index 0000000..6531ee4 --- /dev/null +++ b/src/atan2.c
@@ -0,0 +1,785 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_VAL_WITH_FLAGS +#define USE_NAN_WITH_FLAGS +#define USE_SCALEDOUBLE_1 +#define USE_SCALEDOUBLE_2 +#define USE_SCALEUPDOUBLE1024 +#define USE_SCALEDOWNDOUBLE +#define USE_HANDLE_ERROR +#include "../inc/libm_inlines_amd.h" +#undef USE_VAL_WITH_FLAGS +#undef USE_NAN_WITH_FLAGS +#undef USE_SCALEDOUBLE_1 +#undef USE_SCALEDOUBLE_2 +#undef USE_SCALEUPDOUBLE1024 +#undef USE_SCALEDOWNDOUBLE +#undef USE_HANDLE_ERROR + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range arguments + (only used when _LIB_VERSION is _SVID_) */ +static inline double retval_errno_edom(double x, double y) +{ + struct exception exc; + exc.arg1 = x; + exc.arg2 = y; + exc.name = (char *)"atan2"; + exc.type = DOMAIN; + exc.retval = HUGE; + if (!matherr(&exc)) + { + (void)fputs("atan2: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#ifdef WINDOWS +#pragma function(atan2) +#endif + +double FN_PROTOTYPE(atan2)(double y, double x) +{ + /* Arrays atan_jby256_lead and atan_jby256_tail contain + leading and trailing parts respectively of precomputed + values of atan(j/256), for j = 16, 17, ..., 256. + atan_jby256_lead contains the first 21 bits of precision, + and atan_jby256_tail contains a further 53 bits precision. */ + + static const double atan_jby256_lead[ 241] = { + 6.24187886714935302734e-02, /* 0x3faff55b00000000 */ + 6.63088560104370117188e-02, /* 0x3fb0f99e00000000 */ + 7.01969265937805175781e-02, /* 0x3fb1f86d00000000 */ + 7.40829110145568847656e-02, /* 0x3fb2f71900000000 */ + 7.79666304588317871094e-02, /* 0x3fb3f59f00000000 */ + 8.18479657173156738281e-02, /* 0x3fb4f3fd00000000 */ + 8.57268571853637695312e-02, /* 0x3fb5f23200000000 */ + 8.96031260490417480469e-02, /* 0x3fb6f03b00000000 */ + 9.34767723083496093750e-02, /* 0x3fb7ee1800000000 */ + 9.73475575447082519531e-02, /* 0x3fb8ebc500000000 */ + 1.01215422153472900391e-01, /* 0x3fb9e94100000000 */ + 1.05080246925354003906e-01, /* 0x3fbae68a00000000 */ + 1.08941912651062011719e-01, /* 0x3fbbe39e00000000 */ + 1.12800359725952148438e-01, /* 0x3fbce07c00000000 */ + 1.16655409336090087891e-01, /* 0x3fbddd2100000000 */ + 1.20507001876831054688e-01, /* 0x3fbed98c00000000 */ + 1.24354958534240722656e-01, /* 0x3fbfd5ba00000000 */ + 1.28199219703674316406e-01, /* 0x3fc068d500000000 */ + 1.32039666175842285156e-01, /* 0x3fc0e6ad00000000 */ + 1.35876297950744628906e-01, /* 0x3fc1646500000000 */ + 1.39708757400512695312e-01, /* 0x3fc1e1fa00000000 */ + 1.43537282943725585938e-01, /* 0x3fc25f6e00000000 */ + 1.47361397743225097656e-01, /* 0x3fc2dcbd00000000 */ + 1.51181221008300781250e-01, /* 0x3fc359e800000000 */ + 1.54996633529663085938e-01, /* 0x3fc3d6ee00000000 */ + 1.58807516098022460938e-01, /* 0x3fc453ce00000000 */ + 1.62613749504089355469e-01, /* 0x3fc4d08700000000 */ + 1.66415214538574218750e-01, /* 0x3fc54d1800000000 */ + 1.70211911201477050781e-01, /* 0x3fc5c98100000000 */ + 1.74003481864929199219e-01, /* 0x3fc645bf00000000 */ + 1.77790164947509765625e-01, /* 0x3fc6c1d400000000 */ + 1.81571602821350097656e-01, /* 0x3fc73dbd00000000 */ + 1.85347914695739746094e-01, /* 0x3fc7b97b00000000 */ + 1.89118742942810058594e-01, /* 0x3fc8350b00000000 */ + 1.92884206771850585938e-01, /* 0x3fc8b06e00000000 */ + 1.96644186973571777344e-01, /* 0x3fc92ba300000000 */ + 2.00398445129394531250e-01, /* 0x3fc9a6a800000000 */ + 2.04147100448608398438e-01, /* 0x3fca217e00000000 */ + 2.07889914512634277344e-01, /* 0x3fca9c2300000000 */ + 2.11626768112182617188e-01, /* 0x3fcb169600000000 */ + 2.15357661247253417969e-01, /* 0x3fcb90d700000000 */ + 2.19082474708557128906e-01, /* 0x3fcc0ae500000000 */ + 2.22801089286804199219e-01, /* 0x3fcc84bf00000000 */ + 2.26513504981994628906e-01, /* 0x3fccfe6500000000 */ + 2.30219483375549316406e-01, /* 0x3fcd77d500000000 */ + 2.33919143676757812500e-01, /* 0x3fcdf11000000000 */ + 2.37612247467041015625e-01, /* 0x3fce6a1400000000 */ + 2.41298794746398925781e-01, /* 0x3fcee2e100000000 */ + 2.44978547096252441406e-01, /* 0x3fcf5b7500000000 */ + 2.48651623725891113281e-01, /* 0x3fcfd3d100000000 */ + 2.52317905426025390625e-01, /* 0x3fd025fa00000000 */ + 2.55977153778076171875e-01, /* 0x3fd061ee00000000 */ + 2.59629487991333007812e-01, /* 0x3fd09dc500000000 */ + 2.63274669647216796875e-01, /* 0x3fd0d97e00000000 */ + 2.66912937164306640625e-01, /* 0x3fd1151a00000000 */ + 2.70543813705444335938e-01, /* 0x3fd1509700000000 */ + 2.74167299270629882812e-01, /* 0x3fd18bf500000000 */ + 2.77783632278442382812e-01, /* 0x3fd1c73500000000 */ + 2.81392335891723632812e-01, /* 0x3fd2025500000000 */ + 2.84993648529052734375e-01, /* 0x3fd23d5600000000 */ + 2.88587331771850585938e-01, /* 0x3fd2783700000000 */ + 2.92173147201538085938e-01, /* 0x3fd2b2f700000000 */ + 2.95751571655273437500e-01, /* 0x3fd2ed9800000000 */ + 2.99322128295898437500e-01, /* 0x3fd3281800000000 */ + 3.02884817123413085938e-01, /* 0x3fd3627700000000 */ + 3.06439399719238281250e-01, /* 0x3fd39cb400000000 */ + 3.09986352920532226562e-01, /* 0x3fd3d6d100000000 */ + 3.13524961471557617188e-01, /* 0x3fd410cb00000000 */ + 3.17055702209472656250e-01, /* 0x3fd44aa400000000 */ + 3.20578098297119140625e-01, /* 0x3fd4845a00000000 */ + 3.24092388153076171875e-01, /* 0x3fd4bdee00000000 */ + 3.27598333358764648438e-01, /* 0x3fd4f75f00000000 */ + 3.31095933914184570312e-01, /* 0x3fd530ad00000000 */ + 3.34585189819335937500e-01, /* 0x3fd569d800000000 */ + 3.38066101074218750000e-01, /* 0x3fd5a2e000000000 */ + 3.41538190841674804688e-01, /* 0x3fd5dbc300000000 */ + 3.45002174377441406250e-01, /* 0x3fd6148400000000 */ + 3.48457098007202148438e-01, /* 0x3fd64d1f00000000 */ + 3.51903676986694335938e-01, /* 0x3fd6859700000000 */ + 3.55341434478759765625e-01, /* 0x3fd6bdea00000000 */ + 3.58770608901977539062e-01, /* 0x3fd6f61900000000 */ + 3.62190723419189453125e-01, /* 0x3fd72e2200000000 */ + 3.65602254867553710938e-01, /* 0x3fd7660700000000 */ + 3.69004726409912109375e-01, /* 0x3fd79dc600000000 */ + 3.72398376464843750000e-01, /* 0x3fd7d56000000000 */ + 3.75782966613769531250e-01, /* 0x3fd80cd400000000 */ + 3.79158496856689453125e-01, /* 0x3fd8442200000000 */ + 3.82525205612182617188e-01, /* 0x3fd87b4b00000000 */ + 3.85882616043090820312e-01, /* 0x3fd8b24d00000000 */ + 3.89230966567993164062e-01, /* 0x3fd8e92900000000 */ + 3.92570018768310546875e-01, /* 0x3fd91fde00000000 */ + 3.95900011062622070312e-01, /* 0x3fd9566d00000000 */ + 3.99220705032348632812e-01, /* 0x3fd98cd500000000 */ + 4.02532100677490234375e-01, /* 0x3fd9c31600000000 */ + 4.05834197998046875000e-01, /* 0x3fd9f93000000000 */ + 4.09126996994018554688e-01, /* 0x3fda2f2300000000 */ + 4.12410259246826171875e-01, /* 0x3fda64ee00000000 */ + 4.15684223175048828125e-01, /* 0x3fda9a9200000000 */ + 4.18948888778686523438e-01, /* 0x3fdad00f00000000 */ + 4.22204017639160156250e-01, /* 0x3fdb056400000000 */ + 4.25449609756469726562e-01, /* 0x3fdb3a9100000000 */ + 4.28685665130615234375e-01, /* 0x3fdb6f9600000000 */ + 4.31912183761596679688e-01, /* 0x3fdba47300000000 */ + 4.35129165649414062500e-01, /* 0x3fdbd92800000000 */ + 4.38336372375488281250e-01, /* 0x3fdc0db400000000 */ + 4.41534280776977539062e-01, /* 0x3fdc421900000000 */ + 4.44722414016723632812e-01, /* 0x3fdc765500000000 */ + 4.47900772094726562500e-01, /* 0x3fdcaa6800000000 */ + 4.51069593429565429688e-01, /* 0x3fdcde5300000000 */ + 4.54228639602661132812e-01, /* 0x3fdd121500000000 */ + 4.57377910614013671875e-01, /* 0x3fdd45ae00000000 */ + 4.60517644882202148438e-01, /* 0x3fdd791f00000000 */ + 4.63647603988647460938e-01, /* 0x3fddac6700000000 */ + 4.66767549514770507812e-01, /* 0x3fdddf8500000000 */ + 4.69877958297729492188e-01, /* 0x3fde127b00000000 */ + 4.72978591918945312500e-01, /* 0x3fde454800000000 */ + 4.76069211959838867188e-01, /* 0x3fde77eb00000000 */ + 4.79150056838989257812e-01, /* 0x3fdeaa6500000000 */ + 4.82221126556396484375e-01, /* 0x3fdedcb600000000 */ + 4.85282421112060546875e-01, /* 0x3fdf0ede00000000 */ + 4.88333940505981445312e-01, /* 0x3fdf40dd00000000 */ + 4.91375446319580078125e-01, /* 0x3fdf72b200000000 */ + 4.94406938552856445312e-01, /* 0x3fdfa45d00000000 */ + 4.97428894042968750000e-01, /* 0x3fdfd5e000000000 */ + 5.00440597534179687500e-01, /* 0x3fe0039c00000000 */ + 5.03442764282226562500e-01, /* 0x3fe01c3400000000 */ + 5.06434917449951171875e-01, /* 0x3fe034b700000000 */ + 5.09417057037353515625e-01, /* 0x3fe04d2500000000 */ + 5.12389183044433593750e-01, /* 0x3fe0657e00000000 */ + 5.15351772308349609375e-01, /* 0x3fe07dc300000000 */ + 5.18304347991943359375e-01, /* 0x3fe095f300000000 */ + 5.21246910095214843750e-01, /* 0x3fe0ae0e00000000 */ + 5.24179458618164062500e-01, /* 0x3fe0c61400000000 */ + 5.27101993560791015625e-01, /* 0x3fe0de0500000000 */ + 5.30014991760253906250e-01, /* 0x3fe0f5e200000000 */ + 5.32917976379394531250e-01, /* 0x3fe10daa00000000 */ + 5.35810947418212890625e-01, /* 0x3fe1255d00000000 */ + 5.38693904876708984375e-01, /* 0x3fe13cfb00000000 */ + 5.41567325592041015625e-01, /* 0x3fe1548500000000 */ + 5.44430732727050781250e-01, /* 0x3fe16bfa00000000 */ + 5.47284126281738281250e-01, /* 0x3fe1835a00000000 */ + 5.50127506256103515625e-01, /* 0x3fe19aa500000000 */ + 5.52961349487304687500e-01, /* 0x3fe1b1dc00000000 */ + 5.55785179138183593750e-01, /* 0x3fe1c8fe00000000 */ + 5.58598995208740234375e-01, /* 0x3fe1e00b00000000 */ + 5.61403274536132812500e-01, /* 0x3fe1f70400000000 */ + 5.64197540283203125000e-01, /* 0x3fe20de800000000 */ + 5.66981792449951171875e-01, /* 0x3fe224b700000000 */ + 5.69756031036376953125e-01, /* 0x3fe23b7100000000 */ + 5.72520732879638671875e-01, /* 0x3fe2521700000000 */ + 5.75275897979736328125e-01, /* 0x3fe268a900000000 */ + 5.78021049499511718750e-01, /* 0x3fe27f2600000000 */ + 5.80756187438964843750e-01, /* 0x3fe2958e00000000 */ + 5.83481788635253906250e-01, /* 0x3fe2abe200000000 */ + 5.86197376251220703125e-01, /* 0x3fe2c22100000000 */ + 5.88903427124023437500e-01, /* 0x3fe2d84c00000000 */ + 5.91599464416503906250e-01, /* 0x3fe2ee6200000000 */ + 5.94285964965820312500e-01, /* 0x3fe3046400000000 */ + 5.96962928771972656250e-01, /* 0x3fe31a5200000000 */ + 5.99629878997802734375e-01, /* 0x3fe3302b00000000 */ + 6.02287292480468750000e-01, /* 0x3fe345f000000000 */ + 6.04934692382812500000e-01, /* 0x3fe35ba000000000 */ + 6.07573032379150390625e-01, /* 0x3fe3713d00000000 */ + 6.10201358795166015625e-01, /* 0x3fe386c500000000 */ + 6.12820148468017578125e-01, /* 0x3fe39c3900000000 */ + 6.15428924560546875000e-01, /* 0x3fe3b19800000000 */ + 6.18028640747070312500e-01, /* 0x3fe3c6e400000000 */ + 6.20618820190429687500e-01, /* 0x3fe3dc1c00000000 */ + 6.23198986053466796875e-01, /* 0x3fe3f13f00000000 */ + 6.25770092010498046875e-01, /* 0x3fe4064f00000000 */ + 6.28331184387207031250e-01, /* 0x3fe41b4a00000000 */ + 6.30883216857910156250e-01, /* 0x3fe4303200000000 */ + 6.33425712585449218750e-01, /* 0x3fe4450600000000 */ + 6.35958671569824218750e-01, /* 0x3fe459c600000000 */ + 6.38482093811035156250e-01, /* 0x3fe46e7200000000 */ + 6.40995979309082031250e-01, /* 0x3fe4830a00000000 */ + 6.43500804901123046875e-01, /* 0x3fe4978f00000000 */ + 6.45996093750000000000e-01, /* 0x3fe4ac0000000000 */ + 6.48482322692871093750e-01, /* 0x3fe4c05e00000000 */ + 6.50959014892578125000e-01, /* 0x3fe4d4a800000000 */ + 6.53426170349121093750e-01, /* 0x3fe4e8de00000000 */ + 6.55884265899658203125e-01, /* 0x3fe4fd0100000000 */ + 6.58332824707031250000e-01, /* 0x3fe5111000000000 */ + 6.60772323608398437500e-01, /* 0x3fe5250c00000000 */ + 6.63202762603759765625e-01, /* 0x3fe538f500000000 */ + 6.65623664855957031250e-01, /* 0x3fe54cca00000000 */ + 6.68035984039306640625e-01, /* 0x3fe5608d00000000 */ + 6.70438766479492187500e-01, /* 0x3fe5743c00000000 */ + 6.72832489013671875000e-01, /* 0x3fe587d800000000 */ + 6.75216674804687500000e-01, /* 0x3fe59b6000000000 */ + 6.77592277526855468750e-01, /* 0x3fe5aed600000000 */ + 6.79958820343017578125e-01, /* 0x3fe5c23900000000 */ + 6.82316303253173828125e-01, /* 0x3fe5d58900000000 */ + 6.84664726257324218750e-01, /* 0x3fe5e8c600000000 */ + 6.87004089355468750000e-01, /* 0x3fe5fbf000000000 */ + 6.89334869384765625000e-01, /* 0x3fe60f0800000000 */ + 6.91656589508056640625e-01, /* 0x3fe6220d00000000 */ + 6.93969249725341796875e-01, /* 0x3fe634ff00000000 */ + 6.96272850036621093750e-01, /* 0x3fe647de00000000 */ + 6.98567867279052734375e-01, /* 0x3fe65aab00000000 */ + 7.00854301452636718750e-01, /* 0x3fe66d6600000000 */ + 7.03131675720214843750e-01, /* 0x3fe6800e00000000 */ + 7.05400466918945312500e-01, /* 0x3fe692a400000000 */ + 7.07660198211669921875e-01, /* 0x3fe6a52700000000 */ + 7.09911346435546875000e-01, /* 0x3fe6b79800000000 */ + 7.12153911590576171875e-01, /* 0x3fe6c9f700000000 */ + 7.14387893676757812500e-01, /* 0x3fe6dc4400000000 */ + 7.16613292694091796875e-01, /* 0x3fe6ee7f00000000 */ + 7.18829631805419921875e-01, /* 0x3fe700a700000000 */ + 7.21037864685058593750e-01, /* 0x3fe712be00000000 */ + 7.23237514495849609375e-01, /* 0x3fe724c300000000 */ + 7.25428581237792968750e-01, /* 0x3fe736b600000000 */ + 7.27611064910888671875e-01, /* 0x3fe7489700000000 */ + 7.29785442352294921875e-01, /* 0x3fe75a6700000000 */ + 7.31950759887695312500e-01, /* 0x3fe76c2400000000 */ + 7.34108448028564453125e-01, /* 0x3fe77dd100000000 */ + 7.36257076263427734375e-01, /* 0x3fe78f6b00000000 */ + 7.38397598266601562500e-01, /* 0x3fe7a0f400000000 */ + 7.40530014038085937500e-01, /* 0x3fe7b26c00000000 */ + 7.42654323577880859375e-01, /* 0x3fe7c3d300000000 */ + 7.44770050048828125000e-01, /* 0x3fe7d52800000000 */ + 7.46877670288085937500e-01, /* 0x3fe7e66c00000000 */ + 7.48976707458496093750e-01, /* 0x3fe7f79e00000000 */ + 7.51068115234375000000e-01, /* 0x3fe808c000000000 */ + 7.53150939941406250000e-01, /* 0x3fe819d000000000 */ + 7.55226135253906250000e-01, /* 0x3fe82ad000000000 */ + 7.57292747497558593750e-01, /* 0x3fe83bbe00000000 */ + 7.59351730346679687500e-01, /* 0x3fe84c9c00000000 */ + 7.61402606964111328125e-01, /* 0x3fe85d6900000000 */ + 7.63445377349853515625e-01, /* 0x3fe86e2500000000 */ + 7.65480041503906250000e-01, /* 0x3fe87ed000000000 */ + 7.67507076263427734375e-01, /* 0x3fe88f6b00000000 */ + 7.69526004791259765625e-01, /* 0x3fe89ff500000000 */ + 7.71537303924560546875e-01, /* 0x3fe8b06f00000000 */ + 7.73540973663330078125e-01, /* 0x3fe8c0d900000000 */ + 7.75536537170410156250e-01, /* 0x3fe8d13200000000 */ + 7.77523994445800781250e-01, /* 0x3fe8e17a00000000 */ + 7.79504299163818359375e-01, /* 0x3fe8f1b300000000 */ + 7.81476497650146484375e-01, /* 0x3fe901db00000000 */ + 7.83441066741943359375e-01, /* 0x3fe911f300000000 */ + 7.85398006439208984375e-01}; /* 0x3fe921fb00000000 */ + + static const double atan_jby256_tail[ 241] = { + 2.13244638182005395671e-08, /* 0x3e56e59fbd38db2c */ + 3.89093864761712760656e-08, /* 0x3e64e3aa54dedf96 */ + 4.44780900009437454576e-08, /* 0x3e67e105ab1bda88 */ + 1.15344768460112754160e-08, /* 0x3e48c5254d013fd0 */ + 3.37271051945395312705e-09, /* 0x3e2cf8ab3ad62670 */ + 2.40857608736109859459e-08, /* 0x3e59dca4bec80468 */ + 1.85853810450623807768e-08, /* 0x3e53f4b5ec98a8da */ + 5.14358299969225078306e-08, /* 0x3e6b9d49619d81fe */ + 8.85023985412952486748e-09, /* 0x3e43017887460934 */ + 1.59425154214358432060e-08, /* 0x3e511e3eca0b9944 */ + 1.95139937737755753164e-08, /* 0x3e54f3f73c5a332e */ + 2.64909755273544319715e-08, /* 0x3e5c71c8ae0e00a6 */ + 4.43388037881231070144e-08, /* 0x3e67cde0f86fbdc7 */ + 2.14757072421821274557e-08, /* 0x3e570f328c889c72 */ + 2.61049792670754218852e-08, /* 0x3e5c07ae9b994efe */ + 7.81439350674466302231e-09, /* 0x3e40c8021d7b1698 */ + 3.60125207123751024094e-08, /* 0x3e635585edb8cb22 */ + 6.15276238179343767917e-08, /* 0x3e70842567b30e96 */ + 9.54387964641184285058e-08, /* 0x3e799e811031472e */ + 3.02789566851502754129e-08, /* 0x3e6041821416bcee */ + 1.16888650949870856331e-07, /* 0x3e7f6086e4dc96f4 */ + 1.07580956468653338863e-08, /* 0x3e471a535c5f1b58 */ + 8.33454265379535427653e-08, /* 0x3e765f743fe63ca1 */ + 1.10790279272629526068e-07, /* 0x3e7dbd733472d014 */ + 1.08394277896366207424e-07, /* 0x3e7d18cc4d8b0d1d */ + 9.22176086126841098800e-08, /* 0x3e78c12553c8fb29 */ + 7.90938592199048786990e-08, /* 0x3e753b49e2e8f991 */ + 8.66445407164293125637e-08, /* 0x3e77422ae148c141 */ + 1.40839973537092438671e-08, /* 0x3e4e3ec269df56a8 */ + 1.19070438507307600689e-07, /* 0x3e7ff6754e7e0ac9 */ + 6.40451663051716197071e-08, /* 0x3e7131267b1b5aad */ + 1.08338682076343674522e-07, /* 0x3e7d14fa403a94bc */ + 3.52999550187922736222e-08, /* 0x3e62f396c089a3d8 */ + 1.05983273930043077202e-07, /* 0x3e7c731d78fa95bb */ + 1.05486124078259553339e-07, /* 0x3e7c50f385177399 */ + 5.82167732281776477773e-08, /* 0x3e6f41409c6f2c20 */ + 1.08696483983403942633e-07, /* 0x3e7d2d90c4c39ec0 */ + 4.47335086122377542835e-08, /* 0x3e680420696f2106 */ + 1.26896287162615723528e-08, /* 0x3e4b40327943a2e8 */ + 4.06534471589151404531e-08, /* 0x3e65d35e02f3d2a2 */ + 3.84504846300557026690e-08, /* 0x3e64a498288117b0 */ + 3.60715006404807269080e-08, /* 0x3e635da119afb324 */ + 6.44725903165522722801e-08, /* 0x3e714e85cdb9a908 */ + 3.63749249976409461305e-08, /* 0x3e638754e5547b9a */ + 1.03901294413833913794e-07, /* 0x3e7be40ae6ce3246 */ + 6.25379756302167880580e-08, /* 0x3e70c993b3bea7e7 */ + 6.63984302368488828029e-08, /* 0x3e71d2dd89ac3359 */ + 3.21844598971548278059e-08, /* 0x3e61476603332c46 */ + 1.16030611712765830905e-07, /* 0x3e7f25901bac55b7 */ + 1.17464622142347730134e-07, /* 0x3e7f881b7c826e28 */ + 7.54604017965808996596e-08, /* 0x3e7441996d698d20 */ + 1.49234929356206556899e-07, /* 0x3e8407ac521ea089 */ + 1.41416924523217430259e-07, /* 0x3e82fb0c6c4b1723 */ + 2.13308065617483489011e-07, /* 0x3e8ca135966a3e18 */ + 5.04230937933302320146e-08, /* 0x3e6b1218e4d646e4 */ + 5.45874922281655519035e-08, /* 0x3e6d4e72a350d288 */ + 1.51849028914786868886e-07, /* 0x3e84617e2f04c329 */ + 3.09004308703769273010e-08, /* 0x3e6096ec41e82650 */ + 9.67574548184738317664e-08, /* 0x3e79f91f25773e6e */ + 4.02508285529322212824e-08, /* 0x3e659c0820f1d674 */ + 3.01222268096861091157e-08, /* 0x3e602bf7a2df1064 */ + 2.36189860670079288680e-07, /* 0x3e8fb36bfc40508f */ + 1.14095158111080887695e-07, /* 0x3e7ea08f3f8dc892 */ + 7.42349089746573467487e-08, /* 0x3e73ed6254656a0e */ + 5.12515583196230380184e-08, /* 0x3e6b83f5e5e69c58 */ + 2.19290391828763918102e-07, /* 0x3e8d6ec2af768592 */ + 3.83263512187553886471e-08, /* 0x3e6493889a226f94 */ + 1.61513486284090523855e-07, /* 0x3e85ad8fa65279ba */ + 5.09996743535589922261e-08, /* 0x3e6b615784d45434 */ + 1.23694037861246766534e-07, /* 0x3e809a184368f145 */ + 8.23367955351123783984e-08, /* 0x3e761a2439b0d91c */ + 1.07591766213053694014e-07, /* 0x3e7ce1a65e39a978 */ + 1.42789947524631815640e-07, /* 0x3e832a39a93b6a66 */ + 1.32347123024711878538e-07, /* 0x3e81c3699af804e7 */ + 2.17626067316598149229e-08, /* 0x3e575e0f4e44ede8 */ + 2.34454866923044288656e-07, /* 0x3e8f77ced1a7a83b */ + 2.82966370261766916053e-09, /* 0x3e284e7f0cb1b500 */ + 2.29300919890907632975e-07, /* 0x3e8ec6b838b02dfe */ + 1.48428270450261284915e-07, /* 0x3e83ebf4dfbeda87 */ + 1.87937408574313982512e-07, /* 0x3e89397aed9cb475 */ + 6.13685946813334055347e-08, /* 0x3e707937bc239c54 */ + 1.98585022733583817493e-07, /* 0x3e8aa754553131b6 */ + 7.68394131623752961662e-08, /* 0x3e74a05d407c45dc */ + 1.28119052312436745644e-07, /* 0x3e8132231a206dd0 */ + 7.02119104719236502733e-08, /* 0x3e72d8ecfdd69c88 */ + 9.87954793820636301943e-08, /* 0x3e7a852c74218606 */ + 1.72176752381034986217e-07, /* 0x3e871bf2baeebb50 */ + 1.12877225146169704119e-08, /* 0x3e483d7db7491820 */ + 5.33549829555851737993e-08, /* 0x3e6ca50d92b6da14 */ + 2.13833275710816521345e-08, /* 0x3e56f5cde8530298 */ + 1.16243518048290556393e-07, /* 0x3e7f343198910740 */ + 6.29926408369055877943e-08, /* 0x3e70e8d241ccd80a */ + 6.45429039328021963791e-08, /* 0x3e71535ac619e6c8 */ + 8.64001922814281933403e-08, /* 0x3e77316041c36cd2 */ + 9.50767572202325800240e-08, /* 0x3e7985a000637d8e */ + 5.80851497508121135975e-08, /* 0x3e6f2f29858c0a68 */ + 1.82350561135024766232e-07, /* 0x3e8879847f96d909 */ + 1.98948680587390608655e-07, /* 0x3e8ab3d319e12e42 */ + 7.83548663450197659846e-08, /* 0x3e75088162dfc4c2 */ + 3.04374234486798594427e-08, /* 0x3e605749a1cd9d8c */ + 2.76135725629797411787e-08, /* 0x3e5da65c6c6b8618 */ + 4.32610105454203065470e-08, /* 0x3e6739bf7df1ad64 */ + 5.17107515324127256994e-08, /* 0x3e6bc31252aa3340 */ + 2.82398327875841444660e-08, /* 0x3e5e528191ad3aa8 */ + 1.87482469524195595399e-07, /* 0x3e8929d93df19f18 */ + 2.97481891662714096139e-08, /* 0x3e5ff11eb693a080 */ + 9.94421570843584316402e-09, /* 0x3e455ae3f145a3a0 */ + 1.07056210730391848428e-07, /* 0x3e7cbcd8c6c0ca82 */ + 6.25589580466881163081e-08, /* 0x3e70cb04d425d304 */ + 9.56641013869464593803e-08, /* 0x3e79adfcab5be678 */ + 1.88056307148355440276e-07, /* 0x3e893d90c5662508 */ + 8.38850689379557880950e-08, /* 0x3e768489bd35ff40 */ + 5.01215865527674122924e-09, /* 0x3e3586ed3da2b7e0 */ + 1.74166095998522089762e-07, /* 0x3e87604d2e850eee */ + 9.96779574395363585849e-08, /* 0x3e7ac1d12bfb53d8 */ + 5.98432026368321460686e-09, /* 0x3e39b3d468274740 */ + 1.18362922366887577169e-07, /* 0x3e7fc5d68d10e53c */ + 1.86086833284154215946e-07, /* 0x3e88f9e51884becb */ + 1.97671457251348941011e-07, /* 0x3e8a87f0869c06d1 */ + 1.42447160717199237159e-07, /* 0x3e831e7279f685fa */ + 1.05504240785546574184e-08, /* 0x3e46a8282f9719b0 */ + 3.13335218371639189324e-08, /* 0x3e60d2724a8a44e0 */ + 1.96518418901914535399e-07, /* 0x3e8a60524b11ad4e */ + 2.17692035039173536059e-08, /* 0x3e575fdf832750f0 */ + 2.15613114426529981675e-07, /* 0x3e8cf06902e4cd36 */ + 5.68271098300441214948e-08, /* 0x3e6e82422d4f6d10 */ + 1.70331455823369124256e-08, /* 0x3e524a091063e6c0 */ + 9.17590028095709583247e-08, /* 0x3e78a1a172dc6f38 */ + 2.77266304112916566247e-07, /* 0x3e929b6619f8a92d */ + 9.37041937614656939690e-08, /* 0x3e79274d9c1b70c8 */ + 1.56116346368316796511e-08, /* 0x3e50c34b1fbb7930 */ + 4.13967433808382727413e-08, /* 0x3e6639866c20eb50 */ + 1.70164749185821616276e-07, /* 0x3e86d6d0f6832e9e */ + 4.01708788545600086008e-07, /* 0x3e9af54def99f25e */ + 2.59663539226050551563e-07, /* 0x3e916cfc52a00262 */ + 2.22007487655027469542e-07, /* 0x3e8dcc1e83569c32 */ + 2.90542250809644081369e-07, /* 0x3e937f7a551ed425 */ + 4.67720537666628903341e-07, /* 0x3e9f6360adc98887 */ + 2.79799803956772554802e-07, /* 0x3e92c6ec8d35a2c1 */ + 2.07344552327432547723e-07, /* 0x3e8bd44df84cb036 */ + 2.54705698692735196368e-07, /* 0x3e9117cf826e310e */ + 4.26848589539548450728e-07, /* 0x3e9ca533f332cfc9 */ + 2.52506723633552216197e-07, /* 0x3e90f208509dbc2e */ + 2.14684129933849704964e-07, /* 0x3e8cd07d93c945de */ + 3.20134822201596505431e-07, /* 0x3e957bdfd67e6d72 */ + 9.93537565749855712134e-08, /* 0x3e7aab89c516c658 */ + 3.70792944827917252327e-08, /* 0x3e63e823b1a1b8a0 */ + 1.41772749369083698972e-07, /* 0x3e8307464a9d6d3c */ + 4.22446601490198804306e-07, /* 0x3e9c5993cd438843 */ + 4.11818433724801511540e-07, /* 0x3e9ba2fca02ab554 */ + 1.19976381502605310519e-07, /* 0x3e801a5b6983a268 */ + 3.43703078571520905265e-08, /* 0x3e6273d1b350efc8 */ + 1.66128705555453270379e-07, /* 0x3e864c238c37b0c6 */ + 5.00499610023283006540e-08, /* 0x3e6aded07370a300 */ + 1.75105139941208062123e-07, /* 0x3e878091197eb47e */ + 7.70807146729030327334e-08, /* 0x3e74b0f245e0dabc */ + 2.45918607526895836121e-07, /* 0x3e9080d9794e2eaf */ + 2.18359020958626199345e-07, /* 0x3e8d4ec242b60c76 */ + 8.44342887976445333569e-09, /* 0x3e4221d2f940caa0 */ + 1.07506148687888629299e-07, /* 0x3e7cdbc42b2bba5c */ + 5.36544954316820904572e-08, /* 0x3e6cce37bb440840 */ + 3.39109101518396596341e-07, /* 0x3e96c1d999cf1dd0 */ + 2.60098720293920613340e-08, /* 0x3e5bed8a07eb0870 */ + 8.42678991664621455827e-08, /* 0x3e769ed88f490e3c */ + 5.36972237470183633197e-08, /* 0x3e6cd41719b73ef0 */ + 4.28192558171921681288e-07, /* 0x3e9cbc4ac95b41b7 */ + 2.71535491483955143294e-07, /* 0x3e9238f1b890f5d7 */ + 7.84094998145075780203e-08, /* 0x3e750c4282259cc4 */ + 3.43880599134117431863e-07, /* 0x3e9713d2de87b3e2 */ + 1.32878065060366481043e-07, /* 0x3e81d5a7d2255276 */ + 4.18046802627967629428e-07, /* 0x3e9c0dfd48227ac1 */ + 2.65042411765766019424e-07, /* 0x3e91c964dab76753 */ + 1.70383695347518643694e-07, /* 0x3e86de56d5704496 */ + 1.54096497259613515678e-07, /* 0x3e84aeb71fd19968 */ + 2.36543402412459813461e-07, /* 0x3e8fbf91c57b1918 */ + 4.38416350106876736790e-07, /* 0x3e9d6bef7fbe5d9a */ + 3.03892161339927775731e-07, /* 0x3e9464d3dc249066 */ + 3.31136771605664899240e-07, /* 0x3e9638e2ec4d9073 */ + 6.49494294526590682218e-08, /* 0x3e716f4a7247ea7c */ + 4.10423429887181345747e-09, /* 0x3e31a0a740f1d440 */ + 1.70831640869113847224e-07, /* 0x3e86edbb0114a33c */ + 1.10811512657909180966e-07, /* 0x3e7dbee8bf1d513c */ + 3.23677724749783611964e-07, /* 0x3e95b8bdb0248f73 */ + 3.55662734259192678528e-07, /* 0x3e97de3d3f5eac64 */ + 2.30102333489738219140e-07, /* 0x3e8ee24187ae448a */ + 4.47429004000738629714e-07, /* 0x3e9e06c591ec5192 */ + 7.78167135617329598659e-08, /* 0x3e74e3861a332738 */ + 9.90345291908535415737e-08, /* 0x3e7a9599dcc2bfe4 */ + 5.85800913143113728314e-08, /* 0x3e6f732fbad43468 */ + 4.57859062410871843857e-07, /* 0x3e9eb9f573b727d9 */ + 3.67993069723390929794e-07, /* 0x3e98b212a2eb9897 */ + 2.90836464322977276043e-07, /* 0x3e9384884c167215 */ + 2.51621574250131388318e-07, /* 0x3e90e2d363020051 */ + 2.75789824740652815545e-07, /* 0x3e92820879fbd022 */ + 3.88985776250314403593e-07, /* 0x3e9a1ab9893e4b30 */ + 1.40214080183768019611e-07, /* 0x3e82d1b817a24478 */ + 3.23451432223550478373e-08, /* 0x3e615d7b8ded4878 */ + 9.15979180730608444470e-08, /* 0x3e78968f9db3a5e4 */ + 3.44371402498640470421e-07, /* 0x3e971c4171fe135f */ + 3.40401897215059498077e-07, /* 0x3e96d80f605d0d8c */ + 1.06431813453707950243e-07, /* 0x3e7c91f043691590 */ + 1.46204238932338846248e-07, /* 0x3e839f8a15fce2b2 */ + 9.94610376972039046878e-09, /* 0x3e455beda9d94b80 */ + 2.01711528092681771039e-07, /* 0x3e8b12c15d60949a */ + 2.72027977986191568296e-07, /* 0x3e924167b312bfe3 */ + 2.48402602511693757964e-07, /* 0x3e90ab8633070277 */ + 1.58480011219249621715e-07, /* 0x3e854554ebbc80ee */ + 3.00372828113368713281e-08, /* 0x3e60204aef5a4bb8 */ + 3.67816204583541976394e-07, /* 0x3e98af08c679cf2c */ + 2.46169793032343824291e-07, /* 0x3e90852a330ae6c8 */ + 1.70080468270204253247e-07, /* 0x3e86d3eb9ec32916 */ + 1.67806717763872914315e-07, /* 0x3e8685cb7fcbbafe */ + 2.67715622006907942620e-07, /* 0x3e91f751c1e0bd95 */ + 2.14411342550299170574e-08, /* 0x3e5705b1b0f72560 */ + 4.11228221283669073277e-07, /* 0x3e9b98d8d808ca92 */ + 3.52311752396749662260e-08, /* 0x3e62ea22c75cc980 */ + 3.52718000397367821054e-07, /* 0x3e97aba62bca0350 */ + 4.38857387992911129814e-07, /* 0x3e9d73833442278c */ + 3.22574606753482540743e-07, /* 0x3e95a5ca1fb18bf9 */ + 3.28730371182804296828e-08, /* 0x3e61a6092b6ecf28 */ + 7.56672470607639279700e-08, /* 0x3e744fd049aac104 */ + 3.26750155316369681821e-09, /* 0x3e2c114fd8df5180 */ + 3.21724445362095284743e-07, /* 0x3e95972f130feae5 */ + 1.06639427371776571151e-07, /* 0x3e7ca034a55fe198 */ + 3.41020788139524715063e-07, /* 0x3e96e2b149990227 */ + 1.00582838631232552824e-07, /* 0x3e7b00000294592c */ + 3.68439433859276640065e-07, /* 0x3e98b9bdc442620e */ + 2.20403078342388012027e-07, /* 0x3e8d94fdfabf3e4e */ + 1.62841467098298142534e-07, /* 0x3e85db30b145ad9a */ + 2.25325348296680733838e-07, /* 0x3e8e3e1eb95022b0 */ + 4.37462238226421614339e-07, /* 0x3e9d5b8b45442bd6 */ + 3.52055880555040706500e-07, /* 0x3e97a046231ecd2e */ + 4.75614398494781776825e-07, /* 0x3e9feafe3ef55232 */ + 3.60998399033215317516e-07, /* 0x3e9839e7bfd78267 */ + 3.79292434611513945954e-08, /* 0x3e645cf49d6fa900 */ + 1.29859015528549300061e-08, /* 0x3e4be3132b27f380 */ + 3.15927546985474913188e-07, /* 0x3e9533980bb84f9f */ + 2.28533679887379668031e-08, /* 0x3e5889e2ce3ba390 */ + 1.17222541823553133877e-07, /* 0x3e7f7778c3ad0cc8 */ + 1.51991208405464415857e-07, /* 0x3e846660cec4eba2 */ + 1.56958239325240655564e-07}; /* 0x3e85110b4611a626 */ + + /* Some constants and split constants. */ + + static double pi = 3.1415926535897932e+00, /* 0x400921fb54442d18 */ + piby2 = 1.5707963267948966e+00, /* 0x3ff921fb54442d18 */ + piby4 = 7.8539816339744831e-01, /* 0x3fe921fb54442d18 */ + three_piby4 = 2.3561944901923449e+00, /* 0x4002d97c7f3321d2 */ + pi_head = 3.1415926218032836e+00, /* 0x400921fb50000000 */ + pi_tail = 3.1786509547056392e-08, /* 0x3e6110b4611a6263 */ + piby2_head = 1.5707963267948965e+00, /* 0x3ff921fb54442d18 */ + piby2_tail = 6.1232339957367660e-17; /* 0x3c91a62633145c07 */ + + double u, v, vbyu, q1, q2, s, u1, vu1, u2, vu2, uu, c, r; + unsigned int swap_vu, index, xzero, yzero, xnan, ynan, xinf, yinf; + int m, xexp, yexp, diffexp; + + /* Find properties of arguments x and y. */ + + unsigned long long ux, ui, aux, xneg, uy, auy, yneg; + + GET_BITS_DP64(x, ux); + GET_BITS_DP64(y, uy); + aux = ux & ~SIGNBIT_DP64; + auy = uy & ~SIGNBIT_DP64; + xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + yexp = (int)((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + xneg = ux & SIGNBIT_DP64; + yneg = uy & SIGNBIT_DP64; + xzero = (aux == 0); + yzero = (auy == 0); + xnan = (aux > PINFBITPATT_DP64); + ynan = (auy > PINFBITPATT_DP64); + xinf = (aux == PINFBITPATT_DP64); + yinf = (auy == PINFBITPATT_DP64); + + diffexp = yexp - xexp; + + /* Special cases */ + + if (xnan) +#ifdef WINDOWS + return handle_error("atan2", ux|0x0008000000000000, _DOMAIN, 0, + EDOM, x, y); +#else + return x + x; /* Raise invalid if it's a signalling NaN */ +#endif + else if (ynan) +#ifdef WINDOWS + return handle_error("atan2", uy|0x0008000000000000, _DOMAIN, 0, + EDOM, x, y); +#else + return y + y; /* Raise invalid if it's a signalling NaN */ +#endif + else if (yzero) + { /* Zero y gives +-0 for positive x + and +-pi for negative x */ +#ifndef WINDOWS + if ((_LIB_VERSION == _SVID_) && xzero) + /* Sigh - _SVID_ defines atan2(0,0) as a domain error */ + return retval_errno_edom(x, y); + else +#endif + if (xneg) + { + if (yneg) return val_with_flags(-pi,AMD_F_INEXACT); + else return val_with_flags(pi,AMD_F_INEXACT); + } + else return y; + } + else if (xzero) + { /* Zero x gives +- pi/2 + depending on sign of y */ + if (yneg) return val_with_flags(-piby2,AMD_F_INEXACT); + else val_with_flags(piby2,AMD_F_INEXACT); + } + + /* Scale up both x and y if they are both below 1/4. + This avoids any possible later denormalised arithmetic. */ + + if ((xexp < 1021 && yexp < 1021)) + { + scaleUpDouble1024(ux, &ux); + scaleUpDouble1024(uy, &uy); + PUT_BITS_DP64(ux, x); + PUT_BITS_DP64(uy, y); + xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + yexp = (int)((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + diffexp = yexp - xexp; + } + + if (diffexp > 56) + { /* abs(y)/abs(x) > 2^56 => arctan(x/y) + is insignificant compared to piby2 */ + if (yneg) return val_with_flags(-piby2,AMD_F_INEXACT); + else return val_with_flags(piby2,AMD_F_INEXACT); + } + else if (diffexp < -28 && (!xneg)) + { /* x positive and dominant over y by a factor of 2^28. + In this case atan(y/x) is y/x to machine accuracy. */ + + if (diffexp < -1074) /* Result underflows */ + { + if (yneg) + return val_with_flags(-0.0,AMD_F_INEXACT | AMD_F_UNDERFLOW); + else + return val_with_flags(0.0,AMD_F_INEXACT | AMD_F_UNDERFLOW); + } + else + { + if (diffexp < -1022) + { + /* Result will likely be denormalized */ + y = scaleDouble_1(y, 100); + y /= x; + /* Now y is 2^100 times the true result. Scale it back down. */ + GET_BITS_DP64(y, uy); + scaleDownDouble(uy, 100, &uy); + PUT_BITS_DP64(uy, y); + if ((uy & EXPBITS_DP64) == 0) + return val_with_flags(y, AMD_F_INEXACT | AMD_F_UNDERFLOW); + else + return y; + } + else + return y / x; + } + } + else if (diffexp < -56 && xneg) + { /* abs(x)/abs(y) > 2^56 and x < 0 => arctan(y/x) + is insignificant compared to pi */ + if (yneg) return val_with_flags(-pi,AMD_F_INEXACT); + else return val_with_flags(pi,AMD_F_INEXACT); + } + else if (yinf && xinf) + { /* If abs(x) and abs(y) are both infinity + return +-pi/4 or +- 3pi/4 according to + signs. */ + if (xneg) + { + if (yneg) return val_with_flags(-three_piby4,AMD_F_INEXACT); + else return val_with_flags(three_piby4,AMD_F_INEXACT); + } + else + { + if (yneg) return val_with_flags(-piby4,AMD_F_INEXACT); + else return val_with_flags(piby4,AMD_F_INEXACT); + } + } + + /* General case: take absolute values of arguments */ + + u = x; v = y; + if (xneg) u = -x; + if (yneg) v = -y; + + /* Swap u and v if necessary to obtain 0 < v < u. Compute v/u. */ + + swap_vu = (u < v); + if (swap_vu) { uu = u; u = v; v = uu; } + vbyu = v/u; + + if (vbyu > 0.0625) + { /* General values of v/u. Use a look-up + table and series expansion. */ + + index = (int)(256*vbyu + 0.5); + q1 = atan_jby256_lead[index-16]; + q2 = atan_jby256_tail[index-16]; + c = index*1./256; + GET_BITS_DP64(u, ui); + m = (int)((ui & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64; + u = scaleDouble_2(u,-m); + v = scaleDouble_2(v,-m); + GET_BITS_DP64(u, ui); + PUT_BITS_DP64(0xfffffffff8000000 & ui, u1); /* 26 leading bits of u */ + u2 = u - u1; + + r = ((v-c*u1)-c*u2)/(u+c*v); + + /* Polynomial approximation to atan(r) */ + + s = r*r; + q2 = q2 + r - r*(s * (0.33333333333224095522 - s*(0.19999918038989143496))); + } + else if (vbyu < 1.e-8) + { /* v/u is small enough that atan(v/u) = v/u */ + q1 = 0.0; + q2 = vbyu; + } + else /* vbyu <= 0.0625 */ + { + /* Small values of v/u. Use a series expansion + computed carefully to minimise cancellation */ + + GET_BITS_DP64(u, ui); + PUT_BITS_DP64(0xffffffff00000000 & ui, u1); + GET_BITS_DP64(vbyu, ui); + PUT_BITS_DP64(0xffffffff00000000 & ui, vu1); + u2 = u - u1; + vu2 = vbyu - vu1; + + q1 = 0.0; + s = vbyu*vbyu; + q2 = vbyu + + ((((v - u1*vu1) - u2*vu1) - u*vu2)/u - + (vbyu*s*(0.33333333333333170500 - + s*(0.19999999999393223405 - + s*(0.14285713561807169030 - + s*(0.11110736283514525407 - + s*(0.90029810285449784439E-01))))))); + } + + /* Tidy-up according to which quadrant the arguments lie in */ + + if (swap_vu) {q1 = piby2_head - q1; q2 = piby2_tail - q2;} + if (xneg) {q1 = pi_head - q1; q2 = pi_tail - q2;} + q1 = q1 + q2; + + if (yneg) q1 = - q1; + + return q1; +} + +weak_alias (__atan2, atan2)
diff --git a/src/atan2f.c b/src/atan2f.c new file mode 100644 index 0000000..9b53c6f --- /dev/null +++ b/src/atan2f.c
@@ -0,0 +1,500 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_VALF_WITH_FLAGS +#define USE_NAN_WITH_FLAGS +#define USE_SCALEDOUBLE_1 +#define USE_SCALEDOWNDOUBLE +#define USE_HANDLE_ERRORF +#include "../inc/libm_inlines_amd.h" +#undef USE_VALF_WITH_FLAGS +#undef USE_NAN_WITH_FLAGS +#undef USE_SCALEDOUBLE_1 +#undef USE_SCALEDOWNDOUBLE +#undef USE_HANDLE_ERRORF + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range arguments + (only used when _LIB_VERSION is _SVID_) */ +static inline float retval_errno_edom(float x, float y) +{ + struct exception exc; + exc.arg1 = (double)x; + exc.arg2 = (double)y; + exc.type = DOMAIN; + exc.name = (char *)"atan2f"; + exc.retval = HUGE; + if (!matherr(&exc)) + { + (void)fputs("atan2f: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#ifdef WINDOWS +#pragma function(atan2f) +#endif + +float FN_PROTOTYPE(atan2f)(float fy, float fx) +{ + /* Array atan_jby256 contains precomputed values of atan(j/256), + for j = 16, 17, ..., 256. */ + + static const double atan_jby256[ 241] = { + 6.24188099959573430842e-02, /* 0x3faff55bb72cfde9 */ + 6.63088949198234745008e-02, /* 0x3fb0f99ea71d52a6 */ + 7.01969710718705064423e-02, /* 0x3fb1f86dbf082d58 */ + 7.40829225490337306415e-02, /* 0x3fb2f719318a4a9a */ + 7.79666338315423007588e-02, /* 0x3fb3f59f0e7c559d */ + 8.18479898030765457007e-02, /* 0x3fb4f3fd677292fb */ + 8.57268757707448092464e-02, /* 0x3fb5f2324fd2d7b2 */ + 8.96031774848717321724e-02, /* 0x3fb6f03bdcea4b0c */ + 9.34767811585894559112e-02, /* 0x3fb7ee182602f10e */ + 9.73475734872236708739e-02, /* 0x3fb8ebc54478fb28 */ + 1.01215441667466668485e-01, /* 0x3fb9e94153cfdcf1 */ + 1.05080273416329528224e-01, /* 0x3fbae68a71c722b8 */ + 1.08941956989865793015e-01, /* 0x3fbbe39ebe6f07c3 */ + 1.12800381201659388752e-01, /* 0x3fbce07c5c3cca32 */ + 1.16655435441069349478e-01, /* 0x3fbddd21701eba6e */ + 1.20507009691224548087e-01, /* 0x3fbed98c2190043a */ + 1.24354994546761424279e-01, /* 0x3fbfd5ba9aac2f6d */ + 1.28199281231298117811e-01, /* 0x3fc068d584212b3d */ + 1.32039761614638734288e-01, /* 0x3fc0e6adccf40881 */ + 1.35876328229701304195e-01, /* 0x3fc1646541060850 */ + 1.39708874289163620386e-01, /* 0x3fc1e1fafb043726 */ + 1.43537293701821222491e-01, /* 0x3fc25f6e171a535c */ + 1.47361481088651630200e-01, /* 0x3fc2dcbdb2fba1ff */ + 1.51181331798580037562e-01, /* 0x3fc359e8edeb99a3 */ + 1.54996741923940972718e-01, /* 0x3fc3d6eee8c6626c */ + 1.58807608315631065832e-01, /* 0x3fc453cec6092a9e */ + 1.62613828597948567589e-01, /* 0x3fc4d087a9da4f17 */ + 1.66415301183114927586e-01, /* 0x3fc54d18ba11570a */ + 1.70211925285474380276e-01, /* 0x3fc5c9811e3ec269 */ + 1.74003600935367680469e-01, /* 0x3fc645bfffb3aa73 */ + 1.77790228992676047071e-01, /* 0x3fc6c1d4898933d8 */ + 1.81571711160032150945e-01, /* 0x3fc73dbde8a7d201 */ + 1.85347949995694760705e-01, /* 0x3fc7b97b4bce5b02 */ + 1.89118848926083965578e-01, /* 0x3fc8350be398ebc7 */ + 1.92884312257974643856e-01, /* 0x3fc8b06ee2879c28 */ + 1.96644245190344985064e-01, /* 0x3fc92ba37d050271 */ + 2.00398553825878511514e-01, /* 0x3fc9a6a8e96c8626 */ + 2.04147145182116990236e-01, /* 0x3fca217e601081a5 */ + 2.07889927202262986272e-01, /* 0x3fca9c231b403279 */ + 2.11626808765629753628e-01, /* 0x3fcb1696574d780b */ + 2.15357699697738047551e-01, /* 0x3fcb90d7529260a2 */ + 2.19082510780057748701e-01, /* 0x3fcc0ae54d768466 */ + 2.22801153759394493514e-01, /* 0x3fcc84bf8a742e6d */ + 2.26513541356919617664e-01, /* 0x3fccfe654e1d5395 */ + 2.30219587276843717927e-01, /* 0x3fcd77d5df205736 */ + 2.33919206214733416127e-01, /* 0x3fcdf110864c9d9d */ + 2.37612313865471241892e-01, /* 0x3fce6a148e96ec4d */ + 2.41298826930858800743e-01, /* 0x3fcee2e1451d980c */ + 2.44978663126864143473e-01, /* 0x3fcf5b75f92c80dd */ + 2.48651741190513253521e-01, /* 0x3fcfd3d1fc40dbe4 */ + 2.52317980886427151166e-01, /* 0x3fd025fa510665b5 */ + 2.55977303013005474952e-01, /* 0x3fd061eea03d6290 */ + 2.59629629408257511791e-01, /* 0x3fd09dc597d86362 */ + 2.63274882955282396590e-01, /* 0x3fd0d97ee509acb3 */ + 2.66912987587400396539e-01, /* 0x3fd1151a362431c9 */ + 2.70543868292936529052e-01, /* 0x3fd150973a9ce546 */ + 2.74167451119658789338e-01, /* 0x3fd18bf5a30bf178 */ + 2.77783663178873208022e-01, /* 0x3fd1c735212dd883 */ + 2.81392432649178403370e-01, /* 0x3fd2025567e47c95 */ + 2.84993688779881237938e-01, /* 0x3fd23d562b381041 */ + 2.88587361894077354396e-01, /* 0x3fd278372057ef45 */ + 2.92173383391398755471e-01, /* 0x3fd2b2f7fd9b5fe2 */ + 2.95751685750431536626e-01, /* 0x3fd2ed987a823cfe */ + 2.99322202530807379706e-01, /* 0x3fd328184fb58951 */ + 3.02884868374971361060e-01, /* 0x3fd362773707ebcb */ + 3.06439619009630070945e-01, /* 0x3fd39cb4eb76157b */ + 3.09986391246883430384e-01, /* 0x3fd3d6d129271134 */ + 3.13525122985043869228e-01, /* 0x3fd410cbad6c7d32 */ + 3.17055753209146973237e-01, /* 0x3fd44aa436c2af09 */ + 3.20578221991156986359e-01, /* 0x3fd4845a84d0c21b */ + 3.24092470489871664618e-01, /* 0x3fd4bdee586890e6 */ + 3.27598440950530811477e-01, /* 0x3fd4f75f73869978 */ + 3.31096076704132047386e-01, /* 0x3fd530ad9951cd49 */ + 3.34585322166458920545e-01, /* 0x3fd569d88e1b4cd7 */ + 3.38066122836825466713e-01, /* 0x3fd5a2e0175e0f4e */ + 3.41538425296541714449e-01, /* 0x3fd5dbc3fbbe768d */ + 3.45002177207105076295e-01, /* 0x3fd614840309cfe1 */ + 3.48457327308122011278e-01, /* 0x3fd64d1ff635c1c5 */ + 3.51903825414964732676e-01, /* 0x3fd685979f5fa6fd */ + 3.55341622416168290144e-01, /* 0x3fd6bdeac9cbd76c */ + 3.58770670270572189509e-01, /* 0x3fd6f61941e4def0 */ + 3.62190922004212156882e-01, /* 0x3fd72e22d53aa2a9 */ + 3.65602331706966821034e-01, /* 0x3fd7660752817501 */ + 3.69004854528964421068e-01, /* 0x3fd79dc6899118d1 */ + 3.72398446676754202311e-01, /* 0x3fd7d5604b63b3f7 */ + 3.75783065409248884237e-01, /* 0x3fd80cd46a14b1d0 */ + 3.79158669033441808605e-01, /* 0x3fd84422b8df95d7 */ + 3.82525216899905096124e-01, /* 0x3fd87b4b0c1ebedb */ + 3.85882669398073752109e-01, /* 0x3fd8b24d394a1b25 */ + 3.89230987951320717144e-01, /* 0x3fd8e92916f5cde8 */ + 3.92570135011828580396e-01, /* 0x3fd91fde7cd0c662 */ + 3.95900074055262896078e-01, /* 0x3fd9566d43a34907 */ + 3.99220769575252543149e-01, /* 0x3fd98cd5454d6b18 */ + 4.02532187077682512832e-01, /* 0x3fd9c3165cc58107 */ + 4.05834293074804064450e-01, /* 0x3fd9f93066168001 */ + 4.09127055079168300278e-01, /* 0x3fda2f233e5e530b */ + 4.12410441597387267265e-01, /* 0x3fda64eec3cc23fc */ + 4.15684422123729413467e-01, /* 0x3fda9a92d59e98cf */ + 4.18948967133552840902e-01, /* 0x3fdad00f5422058b */ + 4.22204048076583571270e-01, /* 0x3fdb056420ae9343 */ + 4.25449637370042266227e-01, /* 0x3fdb3a911da65c6c */ + 4.28685708391625730496e-01, /* 0x3fdb6f962e737efb */ + 4.31912235472348193799e-01, /* 0x3fdba473378624a5 */ + 4.35129193889246812521e-01, /* 0x3fdbd9281e528191 */ + 4.38336559857957774877e-01, /* 0x3fdc0db4c94ec9ef */ + 4.41534310525166673322e-01, /* 0x3fdc42191ff11eb6 */ + 4.44722423960939305942e-01, /* 0x3fdc76550aad71f8 */ + 4.47900879150937292206e-01, /* 0x3fdcaa6872f3631b */ + 4.51069655988523443568e-01, /* 0x3fdcde53432c1350 */ + 4.54228735266762495559e-01, /* 0x3fdd121566b7f2ad */ + 4.57378098670320809571e-01, /* 0x3fdd45aec9ec862b */ + 4.60517728767271039558e-01, /* 0x3fdd791f5a1226f4 */ + 4.63647609000806093515e-01, /* 0x3fddac670561bb4f */ + 4.66767723680866497560e-01, /* 0x3fdddf85bb026974 */ + 4.69878057975686880265e-01, /* 0x3fde127b6b0744af */ + 4.72978597903265574054e-01, /* 0x3fde4548066cf51a */ + 4.76069330322761219421e-01, /* 0x3fde77eb7f175a34 */ + 4.79150242925822533735e-01, /* 0x3fdeaa65c7cf28c4 */ + 4.82221324227853687105e-01, /* 0x3fdedcb6d43f8434 */ + 4.85282563559221225002e-01, /* 0x3fdf0ede98f393cf */ + 4.88333951056405479729e-01, /* 0x3fdf40dd0b541417 */ + 4.91375477653101910835e-01, /* 0x3fdf72b221a4e495 */ + 4.94407135071275316562e-01, /* 0x3fdfa45dd3029258 */ + 4.97428915812172245392e-01, /* 0x3fdfd5e0175fdf83 */ + 5.00440813147294050189e-01, /* 0x3fe0039c73c1a40b */ + 5.03442821109336358099e-01, /* 0x3fe01c341e82422d */ + 5.06434934483096732549e-01, /* 0x3fe034b709250488 */ + 5.09417148796356245022e-01, /* 0x3fe04d25314342e5 */ + 5.12389460310737621107e-01, /* 0x3fe0657e94db30cf */ + 5.15351866012543347040e-01, /* 0x3fe07dc3324e9b38 */ + 5.18304363603577900044e-01, /* 0x3fe095f30861a58f */ + 5.21246951491958210312e-01, /* 0x3fe0ae0e1639866c */ + 5.24179628782913242802e-01, /* 0x3fe0c6145b5b43da */ + 5.27102395269579471204e-01, /* 0x3fe0de05d7aa6f7c */ + 5.30015251423793132268e-01, /* 0x3fe0f5e28b67e295 */ + 5.32918198386882147055e-01, /* 0x3fe10daa77307a0d */ + 5.35811237960463593311e-01, /* 0x3fe1255d9bfbd2a8 */ + 5.38694372597246617929e-01, /* 0x3fe13cfbfb1b056e */ + 5.41567605391844897333e-01, /* 0x3fe1548596376469 */ + 5.44430940071603086672e-01, /* 0x3fe16bfa6f5137e1 */ + 5.47284380987436924748e-01, /* 0x3fe1835a88be7c13 */ + 5.50127933104692989907e-01, /* 0x3fe19aa5e5299f99 */ + 5.52961601994028217888e-01, /* 0x3fe1b1dc87904284 */ + 5.55785393822313511514e-01, /* 0x3fe1c8fe7341f64f */ + 5.58599315343562330405e-01, /* 0x3fe1e00babdefeb3 */ + 5.61403373889889367732e-01, /* 0x3fe1f7043557138a */ + 5.64197577362497537656e-01, /* 0x3fe20de813e823b1 */ + 5.66981934222700489912e-01, /* 0x3fe224b74c1d192a */ + 5.69756453482978431069e-01, /* 0x3fe23b71e2cc9e6a */ + 5.72521144698072359525e-01, /* 0x3fe25217dd17e501 */ + 5.75276017956117824426e-01, /* 0x3fe268a940696da6 */ + 5.78021083869819540801e-01, /* 0x3fe27f261273d1b3 */ + 5.80756353567670302596e-01, /* 0x3fe2958e59308e30 */ + 5.83481838685214859730e-01, /* 0x3fe2abe21aded073 */ + 5.86197551356360535557e-01, /* 0x3fe2c2215e024465 */ + 5.88903504204738026395e-01, /* 0x3fe2d84c2961e48b */ + 5.91599710335111383941e-01, /* 0x3fe2ee628406cbca */ + 5.94286183324841177367e-01, /* 0x3fe30464753b090a */ + 5.96962937215401501234e-01, /* 0x3fe31a52048874be */ + 5.99629986503951384336e-01, /* 0x3fe3302b39b78856 */ + 6.02287346134964152178e-01, /* 0x3fe345f01cce37bb */ + 6.04935031491913965951e-01, /* 0x3fe35ba0b60eccce */ + 6.07573058389022313541e-01, /* 0x3fe3713d0df6c503 */ + 6.10201443063065118722e-01, /* 0x3fe386c52d3db11e */ + 6.12820202165241245673e-01, /* 0x3fe39c391cd41719 */ + 6.15429352753104952356e-01, /* 0x3fe3b198e5e2564a */ + 6.18028912282561737612e-01, /* 0x3fe3c6e491c78dc4 */ + 6.20618898599929469384e-01, /* 0x3fe3dc1c2a188504 */ + 6.23199329934065904268e-01, /* 0x3fe3f13fb89e96f4 */ + 6.25770224888563042498e-01, /* 0x3fe4064f47569f48 */ + 6.28331602434009650615e-01, /* 0x3fe41b4ae06fea41 */ + 6.30883481900321840818e-01, /* 0x3fe430328e4b26d5 */ + 6.33425882969144482537e-01, /* 0x3fe445065b795b55 */ + 6.35958825666321447834e-01, /* 0x3fe459c652badc7f */ + 6.38482330354437466191e-01, /* 0x3fe46e727efe4715 */ + 6.40996417725432032775e-01, /* 0x3fe4830aeb5f7bfd */ + 6.43501108793284370968e-01, /* 0x3fe4978fa3269ee1 */ + 6.45996424886771558604e-01, /* 0x3fe4ac00b1c71762 */ + 6.48482387642300484032e-01, /* 0x3fe4c05e22de94e4 */ + 6.50959018996812410762e-01, /* 0x3fe4d4a8023414e8 */ + 6.53426341180761927063e-01, /* 0x3fe4e8de5bb6ec04 */ + 6.55884376711170835605e-01, /* 0x3fe4fd013b7dd17e */ + 6.58333148384755983962e-01, /* 0x3fe51110adc5ed81 */ + 6.60772679271132590273e-01, /* 0x3fe5250cbef1e9fa */ + 6.63202992706093175102e-01, /* 0x3fe538f57b89061e */ + 6.65624112284960989250e-01, /* 0x3fe54ccaf0362c8f */ + 6.68036061856020157990e-01, /* 0x3fe5608d29c70c34 */ + 6.70438865514021320458e-01, /* 0x3fe5743c352b33b9 */ + 6.72832547593763097282e-01, /* 0x3fe587d81f732fba */ + 6.75217132663749830535e-01, /* 0x3fe59b60f5cfab9d */ + 6.77592645519925151909e-01, /* 0x3fe5aed6c5909517 */ + 6.79959111179481823228e-01, /* 0x3fe5c2399c244260 */ + 6.82316554874748071313e-01, /* 0x3fe5d58987169b18 */ + 6.84665002047148862907e-01, /* 0x3fe5e8c6941043cf */ + 6.87004478341244895212e-01, /* 0x3fe5fbf0d0d5cc49 */ + 6.89335009598845749323e-01, /* 0x3fe60f084b46e05e */ + 6.91656621853199760075e-01, /* 0x3fe6220d115d7b8d */ + 6.93969341323259825138e-01, /* 0x3fe634ff312d1f3b */ + 6.96273194408023488045e-01, /* 0x3fe647deb8e20b8f */ + 6.98568207680949848637e-01, /* 0x3fe65aabb6c07b02 */ + 7.00854407884450081312e-01, /* 0x3fe66d663923e086 */ + 7.03131821924453670469e-01, /* 0x3fe6800e4e7e2857 */ + 7.05400476865049030906e-01, /* 0x3fe692a40556fb6a */ + 7.07660399923197958039e-01, /* 0x3fe6a5276c4b0575 */ + 7.09911618463524796141e-01, /* 0x3fe6b798920b3d98 */ + 7.12154159993178659249e-01, /* 0x3fe6c9f7855c3198 */ + 7.14388052156768926793e-01, /* 0x3fe6dc44551553ae */ + 7.16613322731374569052e-01, /* 0x3fe6ee7f10204aef */ + 7.18829999621624415873e-01, /* 0x3fe700a7c5784633 */ + 7.21038110854851588272e-01, /* 0x3fe712be84295198 */ + 7.23237684576317874097e-01, /* 0x3fe724c35b4fae7b */ + 7.25428749044510712274e-01, /* 0x3fe736b65a172dff */ + 7.27611332626510676214e-01, /* 0x3fe748978fba8e0f */ + 7.29785463793429123314e-01, /* 0x3fe75a670b82d8d8 */ + 7.31951171115916565668e-01, /* 0x3fe76c24dcc6c6c0 */ + 7.34108483259739652560e-01, /* 0x3fe77dd112ea22c7 */ + 7.36257428981428097003e-01, /* 0x3fe78f6bbd5d315e */ + 7.38398037123989547936e-01, /* 0x3fe7a0f4eb9c19a2 */ + 7.40530336612692630105e-01, /* 0x3fe7b26cad2e50fd */ + 7.42654356450917929600e-01, /* 0x3fe7c3d311a6092b */ + 7.44770125716075148681e-01, /* 0x3fe7d528289fa093 */ + 7.46877673555587429099e-01, /* 0x3fe7e66c01c114fd */ + 7.48977029182941400620e-01, /* 0x3fe7f79eacb97898 */ + 7.51068221873802288613e-01, /* 0x3fe808c03940694a */ + 7.53151280962194302759e-01, /* 0x3fe819d0b7158a4c */ + 7.55226235836744863583e-01, /* 0x3fe82ad036000005 */ + 7.57293115936992444759e-01, /* 0x3fe83bbec5cdee22 */ + 7.59351950749757920178e-01, /* 0x3fe84c9c7653f7ea */ + 7.61402769805578416573e-01, /* 0x3fe85d69576cc2c5 */ + 7.63445602675201784315e-01, /* 0x3fe86e2578f87ae5 */ + 7.65480478966144461950e-01, /* 0x3fe87ed0eadc5a2a */ + 7.67507428319308182552e-01, /* 0x3fe88f6bbd023118 */ + 7.69526480405658186434e-01, /* 0x3fe89ff5ff57f1f7 */ + 7.71537664922959498526e-01, /* 0x3fe8b06fc1cf3dfe */ + 7.73541011592573490852e-01, /* 0x3fe8c0d9145cf49d */ + 7.75536550156311621507e-01, /* 0x3fe8d13206f8c4ca */ + 7.77524310373347682379e-01, /* 0x3fe8e17aa99cc05d */ + 7.79504322017186335181e-01, /* 0x3fe8f1b30c44f167 */ + 7.81476614872688268854e-01, /* 0x3fe901db3eeef187 */ + 7.83441218733151756304e-01, /* 0x3fe911f35199833b */ + 7.85398163397448278999e-01}; /* 0x3fe921fb54442d18 */ + + /* Some constants. */ + + static double pi = 3.1415926535897932e+00, /* 0x400921fb54442d18 */ + piby2 = 1.5707963267948966e+00, /* 0x3ff921fb54442d18 */ + piby4 = 7.8539816339744831e-01, /* 0x3fe921fb54442d18 */ + three_piby4 = 2.3561944901923449e+00; /* 0x4002d97c7f3321d2 */ + + double u, v, vbyu, q, s, uu, r; + unsigned int swap_vu, index, xzero, yzero, xnan, ynan, xinf, yinf; + int xexp, yexp, diffexp; + + double x = fx; + double y = fy; + + /* Find properties of arguments x and y. */ + + unsigned long long ux, aux, xneg, uy, auy, yneg; + + GET_BITS_DP64(x, ux); + GET_BITS_DP64(y, uy); + aux = ux & ~SIGNBIT_DP64; + auy = uy & ~SIGNBIT_DP64; + xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + yexp = (int)((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + xneg = ux & SIGNBIT_DP64; + yneg = uy & SIGNBIT_DP64; + xzero = (aux == 0); + yzero = (auy == 0); + xnan = (aux > PINFBITPATT_DP64); + ynan = (auy > PINFBITPATT_DP64); + xinf = (aux == PINFBITPATT_DP64); + yinf = (auy == PINFBITPATT_DP64); + + diffexp = yexp - xexp; + + /* Special cases */ + + if (xnan) +#ifdef WINDOWS + { + unsigned int ufx; + GET_BITS_SP32(fx, ufx); + return handle_errorf("atan2f", ufx|0x00400000, _DOMAIN, 0, EDOM, fx, fy); + } +#else + return fx + fx; /* Raise invalid if it's a signalling NaN */ +#endif + else if (ynan) +#ifdef WINDOWS + { + unsigned int ufy; + GET_BITS_SP32(fy, ufy); + return handle_errorf("atan2f", ufy|0x00400000, _DOMAIN, 0, EDOM, fx, fy); + } +#else + return (float)(y + y); /* Raise invalid if it's a signalling NaN */ +#endif + else if (yzero) + { /* Zero y gives +-0 for positive x + and +-pi for negative x */ +#ifndef WINDOWS + if ((_LIB_VERSION == _SVID_) && xzero) + /* Sigh - _SVID_ defines atan2(0,0) as a domain error */ + return retval_errno_edom(x, y); + else +#endif + if (xneg) + { + if (yneg) return valf_with_flags((float)-pi, AMD_F_INEXACT); + else return valf_with_flags((float)pi, AMD_F_INEXACT); + } + else return (float)y; + } + else if (xzero) + { /* Zero x gives +- pi/2 + depending on sign of y */ + if (yneg) return valf_with_flags((float)-piby2, AMD_F_INEXACT); + else valf_with_flags((float)piby2, AMD_F_INEXACT); + } + + if (diffexp > 26) + { /* abs(y)/abs(x) > 2^26 => arctan(x/y) + is insignificant compared to piby2 */ + if (yneg) return valf_with_flags((float)-piby2, AMD_F_INEXACT); + else return valf_with_flags((float)piby2, AMD_F_INEXACT); + } + else if (diffexp < -13 && (!xneg)) + { /* x positive and dominant over y by a factor of 2^13. + In this case atan(y/x) is y/x to machine accuracy. */ + + if (diffexp < -150) /* Result underflows */ + { + if (yneg) + return valf_with_flags(-0.0F, AMD_F_INEXACT | AMD_F_UNDERFLOW); + else + return valf_with_flags(0.0F, AMD_F_INEXACT | AMD_F_UNDERFLOW); + } + else + { + if (diffexp < -126) + { + /* Result will likely be denormalized */ + y = scaleDouble_1(y, 100); + y /= x; + /* Now y is 2^100 times the true result. Scale it back down. */ + GET_BITS_DP64(y, uy); + scaleDownDouble(uy, 100, &uy); + PUT_BITS_DP64(uy, y); + if ((uy & EXPBITS_DP64) == 0) + return valf_with_flags((float)y, AMD_F_INEXACT | AMD_F_UNDERFLOW); + else + return (float)y; + } + else + return (float)(y / x); + } + } + else if (diffexp < -26 && xneg) + { /* abs(x)/abs(y) > 2^56 and x < 0 => arctan(y/x) + is insignificant compared to pi */ + if (yneg) return valf_with_flags((float)-pi, AMD_F_INEXACT); + else return valf_with_flags((float)pi, AMD_F_INEXACT); + } + else if (yinf && xinf) + { /* If abs(x) and abs(y) are both infinity + return +-pi/4 or +- 3pi/4 according to + signs. */ + if (xneg) + { + if (yneg) return valf_with_flags((float)-three_piby4, AMD_F_INEXACT); + else return valf_with_flags((float)three_piby4, AMD_F_INEXACT); + } + else + { + if (yneg) return valf_with_flags((float)-piby4, AMD_F_INEXACT); + else return valf_with_flags((float)piby4, AMD_F_INEXACT); + } + } + + /* General case: take absolute values of arguments */ + + u = x; v = y; + if (xneg) u = -x; + if (yneg) v = -y; + + /* Swap u and v if necessary to obtain 0 < v < u. Compute v/u. */ + + swap_vu = (u < v); + if (swap_vu) { uu = u; u = v; v = uu; } + vbyu = v/u; + + if (vbyu > 0.0625) + { /* General values of v/u. Use a look-up + table and series expansion. */ + + index = (int)(256*vbyu + 0.5); + r = (256*v-index*u)/(256*u+index*v); + + /* Polynomial approximation to atan(vbyu) */ + + s = r*r; + q = atan_jby256[index-16] + r - r*s*0.33333333333224095522; + } + else if (vbyu < 1.e-4) + { /* v/u is small enough that atan(v/u) = v/u */ + q = vbyu; + } + else /* vbyu <= 0.0625 */ + { + /* Small values of v/u. Use a series expansion */ + + s = vbyu*vbyu; + q = vbyu - + vbyu*s*(0.33333333333333170500 - + s*(0.19999999999393223405 - + s*0.14285713561807169030)); + } + + /* Tidy-up according to which quadrant the arguments lie in */ + + if (swap_vu) {q = piby2 - q;} + if (xneg) {q = pi - q;} + if (yneg) q = - q; + return (float)q; +} + +weak_alias (__atan2f, atan2f)
diff --git a/src/atanf.c b/src/atanf.c new file mode 100644 index 0000000..567dd87 --- /dev/null +++ b/src/atanf.c
@@ -0,0 +1,170 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_VALF_WITH_FLAGS +#define USE_NAN_WITH_FLAGS +#define USE_HANDLE_ERRORF +#include "../inc/libm_inlines_amd.h" +#undef USE_VALF_WITH_FLAGS +#undef USE_NAN_WITH_FLAGS +#undef USE_HANDLE_ERRORF + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range argument */ +static inline float retval_errno_edom(float x) +{ + struct exception exc; + exc.arg1 = (float)x; + exc.arg2 = (float)x; + exc.name = (char *)"atanf"; + exc.type = DOMAIN; + if (_LIB_VERSION == _SVID_) + exc.retval = HUGE; + else + exc.retval = nan_with_flags(AMD_F_INVALID); + if (_LIB_VERSION == _POSIX_) + __set_errno(EDOM); + else if (!matherr(&exc)) + { + if(_LIB_VERSION == _SVID_) + (void)fputs("atanf: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#ifdef WINDOWS +#pragma function(atanf) +#endif + +float FN_PROTOTYPE(atanf)(float fx) +{ + + /* Some constants and split constants. */ + + static double piby2 = 1.5707963267948966e+00; /* 0x3ff921fb54442d18 */ + + double c, v, s, q, z; + unsigned int xnan; + + double x = fx; + + /* Find properties of argument fx. */ + + unsigned long long ux, aux, xneg; + + GET_BITS_DP64(x, ux); + aux = ux & ~SIGNBIT_DP64; + xneg = ux & SIGNBIT_DP64; + + v = x; + if (xneg) v = -x; + + /* Argument reduction to range [-7/16,7/16] */ + + if (aux < 0x3ec0000000000000) /* v < 2.0^(-19) */ + { + /* x is a good approximation to atan(x) */ + if (aux == 0x0000000000000000) + return fx; + else + return valf_with_flags(fx, AMD_F_INEXACT); + } + else if (aux < 0x3fdc000000000000) /* v < 7./16. */ + { + x = v; + c = 0.0; + } + else if (aux < 0x3fe6000000000000) /* v < 11./16. */ + { + x = (2.0*v-1.0)/(2.0+v); + /* c = arctan(0.5) */ + c = 4.63647609000806093515e-01; /* 0x3fddac670561bb4f */ + } + else if (aux < 0x3ff3000000000000) /* v < 19./16. */ + { + x = (v-1.0)/(1.0+v); + /* c = arctan(1.) */ + c = 7.85398163397448278999e-01; /* 0x3fe921fb54442d18 */ + } + else if (aux < 0x4003800000000000) /* v < 39./16. */ + { + x = (v-1.5)/(1.0+1.5*v); + /* c = arctan(1.5) */ + c = 9.82793723247329054082e-01; /* 0x3fef730bd281f69b */ + } + else + { + + xnan = (aux > PINFBITPATT_DP64); + + if (xnan) + { + /* x is NaN */ +#ifdef WINDOWS + unsigned int uhx; + GET_BITS_SP32(fx, uhx); + return handle_errorf("atanf", uhx|0x00400000, _DOMAIN, + 0, EDOM, fx, 0.0F); +#else + return x + x; /* Raise invalid if it's a signalling NaN */ +#endif + } + else if (aux > 0x4190000000000000) + { /* abs(x) > 2^26 => arctan(1/x) is + insignificant compared to piby2 */ + if (xneg) + return valf_with_flags((float)-piby2, AMD_F_INEXACT); + else + return valf_with_flags((float)piby2, AMD_F_INEXACT); + } + + x = -1.0/v; + /* c = arctan(infinity) */ + c = 1.57079632679489655800e+00; /* 0x3ff921fb54442d18 */ + } + + /* Core approximation: Remez(2,2) on [-7/16,7/16] */ + + s = x*x; + q = x*s* + (0.296528598819239217902158651186e0 + + (0.192324546402108583211697690500e0 + + 0.470677934286149214138357545549e-2*s)*s)/ + (0.889585796862432286486651434570e0 + + (0.111072499995399550138837673349e1 + + 0.299309699959659728404442796915e0*s)*s); + + z = c - (q - x); + + if (xneg) z = -z; + return (float)z; +} + +weak_alias (__atanf, atanf)
diff --git a/src/atanh.c b/src/atanh.c new file mode 100644 index 0000000..5815ced --- /dev/null +++ b/src/atanh.c
@@ -0,0 +1,193 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_NAN_WITH_FLAGS +#define USE_VAL_WITH_FLAGS +#define USE_INFINITY_WITH_FLAGS +#define USE_HANDLE_ERROR +#include "../inc/libm_inlines_amd.h" +#undef USE_NAN_WITH_FLAGS +#undef USE_VAL_WITH_FLAGS +#undef USE_INFINITY_WITH_FLAGS +#undef USE_HANDLE_ERROR + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range argument */ +static inline double retval_errno_edom(double x, double retval) +{ + struct exception exc; + exc.arg1 = x; + exc.arg2 = x; + exc.type = DOMAIN; + exc.name = (char *)"atanh"; + if (_LIB_VERSION == _SVID_) + exc.retval = -HUGE; + else + exc.retval = retval; + if (_LIB_VERSION == _POSIX_) + __set_errno(EDOM); + else if (!matherr(&exc)) + { + if(_LIB_VERSION == _SVID_) + (void)fputs("atanh: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#undef _FUNCNAME +#define _FUNCNAME "atanh" +double FN_PROTOTYPE(atanh)(double x) +{ + + unsigned long long ux, ax; + double r, absx, t, poly; + + + GET_BITS_DP64(x, ux); + ax = ux & ~SIGNBIT_DP64; + PUT_BITS_DP64(ax, absx); + + if ((ux & EXPBITS_DP64) == EXPBITS_DP64) + { + /* x is either NaN or infinity */ + if (ux & MANTBITS_DP64) + { + /* x is NaN */ +#ifdef WINDOWS + return handle_error(_FUNCNAME, ux|0x0008000000000000, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + } + else + { + /* x is infinity; return a NaN */ +#ifdef WINDOWS + return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return retval_errno_edom(x,nan_with_flags(AMD_F_INVALID)); +#endif + } + } + else if (ax >= 0x3ff0000000000000) + { + if (ax > 0x3ff0000000000000) + { + /* abs(x) > 1.0; return NaN */ +#ifdef WINDOWS + return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return retval_errno_edom(x,nan_with_flags(AMD_F_INVALID)); +#endif + } + else if (ux == 0x3ff0000000000000) + { + /* x = +1.0; return infinity with the same sign as x + and set the divbyzero status flag */ +#ifdef WINDOWS + return handle_error(_FUNCNAME, PINFBITPATT_DP64, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return retval_errno_edom(x,infinity_with_flags(AMD_F_DIVBYZERO)); +#endif + } + else + { + /* x = -1.0; return infinity with the same sign as x */ +#ifdef WINDOWS + return handle_error(_FUNCNAME, NINFBITPATT_DP64, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return retval_errno_edom(x,-infinity_with_flags(AMD_F_DIVBYZERO)); +#endif + } + } + + + if (ax < 0x3e30000000000000) + { + if (ax == 0x0000000000000000) + { + /* x is +/-zero. Return the same zero. */ + return x; + } + else + { + /* Arguments smaller than 2^(-28) in magnitude are + approximated by atanh(x) = x, raising inexact flag. */ + return val_with_flags(x, AMD_F_INEXACT); + } + } + else + { + if (ax < 0x3fe0000000000000) + { + /* Arguments up to 0.5 in magnitude are + approximated by a [5,5] minimax polynomial */ + t = x*x; + poly = + (0.47482573589747356373e0 + + (-0.11028356797846341457e1 + + (0.88468142536501647470e0 + + (-0.28180210961780814148e0 + + (0.28728638600548514553e-1 - + 0.10468158892753136958e-3 * t) * t) * t) * t) * t) / + (0.14244772076924206909e1 + + (-0.41631933639693546274e1 + + (0.45414700626084508355e1 + + (-0.22608883748988489342e1 + + (0.49561196555503101989e0 - + 0.35861554370169537512e-1 * t) * t) * t) * t) * t); + return x + x*t*poly; + } + else + { + /* abs(x) >= 0.5 */ + /* Note that + atanh(x) = 0.5 * ln((1+x)/(1-x)) + (see Abramowitz and Stegun 4.6.22). + For greater accuracy we use the variant formula + atanh(x) = log(1 + 2x/(1-x)) = log1p(2x/(1-x)). + */ + r = (2.0 * absx) / (1.0 - absx); + r = 0.5 * FN_PROTOTYPE(log1p)(r); + if (ux & SIGNBIT_DP64) + /* Argument x is negative */ + return -r; + else + return r; + } + } +} + +weak_alias (__atanh, atanh)
diff --git a/src/atanhf.c b/src/atanhf.c new file mode 100644 index 0000000..38692b4 --- /dev/null +++ b/src/atanhf.c
@@ -0,0 +1,194 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include <stdio.h> + +#define USE_NANF_WITH_FLAGS +#define USE_VALF_WITH_FLAGS +#define USE_INFINITYF_WITH_FLAGS +#define USE_HANDLE_ERRORF +#include "../inc/libm_inlines_amd.h" +#undef USE_NANF_WITH_FLAGS +#undef USE_VALF_WITH_FLAGS +#undef USE_INFINITYF_WITH_FLAGS +#undef USE_HANDLE_ERRORF + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range argument */ +static inline float retval_errno_edom(float x, float retval) +{ + struct exception exc; + exc.arg1 = (double)x; + exc.arg2 = (double)x; + exc.type = DOMAIN; + exc.name = (char *)"atanhf"; + if (_LIB_VERSION == _SVID_) + exc.retval = -HUGE; + else + exc.retval = (double)retval; + if (_LIB_VERSION == _POSIX_) + __set_errno(EDOM); + else if (!matherr(&exc)) + { + if(_LIB_VERSION == _SVID_) + (void)fputs("atanhf: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#undef _FUNCNAME +#define _FUNCNAME "atanhf" +float FN_PROTOTYPE(atanhf)(float x) +{ + + double dx; + unsigned int ux, ax; + double r, t, poly; + + GET_BITS_SP32(x, ux); + ax = ux & ~SIGNBIT_SP32; + + if ((ux & EXPBITS_SP32) == EXPBITS_SP32) + { + /* x is either NaN or infinity */ + if (ux & MANTBITS_SP32) + { + /* x is NaN */ +#ifdef WINDOWS + return handle_errorf(_FUNCNAME, ux|0x00400000, _DOMAIN, + 0, EDOM, x, 0.0F); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + } + else + { + /* x is infinity; return a NaN */ +#ifdef WINDOWS + return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0F); +#else + return retval_errno_edom(x,nanf_with_flags(AMD_F_INVALID)); +#endif + } + } + else if (ax >= 0x3f800000) + { + if (ax > 0x3f800000) + { + /* abs(x) > 1.0; return NaN */ +#ifdef WINDOWS + return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0F); +#else + return retval_errno_edom(x,nanf_with_flags(AMD_F_INVALID)); +#endif + } + else if (ux == 0x3f800000) + { + /* x = +1.0; return infinity with the same sign as x + and set the divbyzero status flag */ +#ifdef WINDOWS + return handle_errorf(_FUNCNAME, PINFBITPATT_SP32, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0F); +#else + return retval_errno_edom(x,infinityf_with_flags(AMD_F_DIVBYZERO)); +#endif + } + else + { + /* x = -1.0; return infinity with the same sign as x */ +#ifdef WINDOWS + return handle_errorf(_FUNCNAME, NINFBITPATT_SP32, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0F); +#else + return retval_errno_edom(x,-infinityf_with_flags(AMD_F_DIVBYZERO)); +#endif + } + } + + if (ax < 0x39000000) + { + if (ax == 0x00000000) + { + /* x is +/-zero. Return the same zero. */ + return x; + } + else + { + /* Arguments smaller than 2^(-13) in magnitude are + approximated by atanhf(x) = x, raising inexact flag. */ + return valf_with_flags(x, AMD_F_INEXACT); + } + } + else + { + dx = x; + if (ax < 0x3f000000) + { + /* Arguments up to 0.5 in magnitude are + approximated by a [2,2] minimax polynomial */ + t = dx*dx; + poly = + (0.39453629046e0 + + (-0.28120347286e0 + + 0.92834212715e-2 * t) * t) / + (0.11836088638e1 + + (-0.15537744551e1 + + 0.45281890445e0 * t) * t); + return (float)(dx + dx*t*poly); + } + else + { + /* abs(x) >= 0.5 */ + /* Note that + atanhf(x) = 0.5 * ln((1+x)/(1-x)) + (see Abramowitz and Stegun 4.6.22). + For greater accuracy we use the variant formula + atanhf(x) = log(1 + 2x/(1-x)) = log1p(2x/(1-x)). + */ + if (ux & SIGNBIT_SP32) + { + /* Argument x is negative */ + r = (-2.0 * dx) / (1.0 + dx); + r = 0.5 * FN_PROTOTYPE(log1p)(r); + return (float)-r; + } + else + { + r = (2.0 * dx) / (1.0 - dx); + r = 0.5 * FN_PROTOTYPE(log1p)(r); + return (float)r; + } + } + } +} + +weak_alias (__atanhf, atanhf)
diff --git a/src/ceil.c b/src/ceil.c new file mode 100644 index 0000000..94ef21d --- /dev/null +++ b/src/ceil.c
@@ -0,0 +1,104 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#ifdef WINDOWS +#include "../inc/libm_errno_amd.h" +#define USE_HANDLE_ERROR +#include "../inc/libm_inlines_amd.h" +#undef USE_HANDLE_ERROR +#endif + +#ifdef WINDOWS +#pragma function(ceil) +#endif + +double FN_PROTOTYPE(ceil)(double x) +{ + double r; + long long rexp, xneg; + unsigned long long ux, ax, ur, mask; + + GET_BITS_DP64(x, ux); + /*ax is |x|*/ + ax = ux & (~SIGNBIT_DP64); + /*xneg stores the sign of the input x*/ + xneg = (ux != ax); + /*The range is divided into + > 2^53. ceil will either the number itself or Nan + always returns a QNan. Raises exception if input is a SNan + < 1.0 If 0.0 then return with the appropriate sign + If input is less than -0.0 and greater than -1.0 then return -0.0 + If input is greater than 0.0 and less than 1.0 then return 1.0 + 1.0 < |x| < 2^53 + appropriately check the exponent and set the return Value by shifting + */ + if (ax >= 0x4340000000000000) /* abs(x) > 2^53*/ + { + /* abs(x) is either NaN, infinity, or >= 2^53 */ + if (ax > 0x7ff0000000000000) + /* x is NaN */ +#ifdef WINDOWS + return handle_error("ceil", ux|0x0008000000000000, _DOMAIN, 0, + EDOM, x, 0.0); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + else + return x; + } + else if (ax < 0x3ff0000000000000) /* abs(x) < 1.0 */ + { + if (ax == 0x0000000000000000) + /* x is +zero or -zero; return the same zero */ + return x; + else if (xneg) /* x < 0.0; return -0.0 */ + { + PUT_BITS_DP64(0x8000000000000000, r); + return r; + } + else + return 1.0; + } + else + { + /*Get the exponent for the floating point number. Should be between 0 and 53*/ + rexp = ((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64; + /* Mask out the bits of r that we don't want */ + mask = 1; + mask = (mask << (EXPSHIFTBITS_DP64 - rexp)) - 1; + /*Keeps the exponent part and the required mantissa.*/ + ur = (ux & ~mask); + PUT_BITS_DP64(ur, r); + if (xneg || (ur == ux)) + return r; + else + /* We threw some bits away and x was positive */ + return r + 1.0; + } + +} + +weak_alias (__ceil, ceil)
diff --git a/src/ceilf.c b/src/ceilf.c new file mode 100644 index 0000000..56d0c37 --- /dev/null +++ b/src/ceilf.c
@@ -0,0 +1,97 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#ifdef WINDOWS +#include "../inc/libm_errno_amd.h" +#define USE_HANDLE_ERRORF +#include "../inc/libm_inlines_amd.h" +#undef USE_HANDLE_ERRORF +#endif + +#ifdef WINDOWS +#pragma function(ceilf) +#endif + +float FN_PROTOTYPE(ceilf)(float x) +{ + float r; + int rexp, xneg; + unsigned int ux, ax, ur, mask; + + GET_BITS_SP32(x, ux); + /*ax is |x|*/ + ax = ux & (~SIGNBIT_SP32); + /*xneg stores the sign of the input x*/ + xneg = (ux != ax); + /*The range is divided into + > 2^24. ceil will either the number itself or Nan + always returns a QNan. Raises exception if input is a SNan + < 1.0 If 0.0 then return with the appropriate sign + If input is less than -0.0 and greater than -1.0 then return -0.0 + If input is greater than 0.0 and less than 1.0 then return 1.0 + 1.0 < |x| < 2^24 + appropriately check the exponent and set the return Value by shifting + */ + if (ax >= 0x4b800000) /* abs(x) > 2^24*/ + { + /* abs(x) is either NaN, infinity, or >= 2^24 */ + if (ax > 0x7f800000) + /* x is NaN */ +#ifdef WINDOWS + return handle_errorf("ceilf", ux, _DOMAIN, 0, EDOM, x, 0.0F); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + else + return x; + } + else if (ax < 0x3f800000) /* abs(x) < 1.0 */ + { + if (ax == 0x00000000) + /* x is +zero or -zero; return the same zero */ + return x; + else if (xneg) /* x < 0.0 */ + return -0.0F; + else + return 1.0F; + } + else + { + rexp = ((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32; + /* Mask out the bits of r that we don't want */ + mask = (1 << (EXPSHIFTBITS_SP32 - rexp)) - 1; + /*Keeps the exponent part and the required mantissa.*/ + ur = (ux & ~mask); + PUT_BITS_SP32(ur, r); + + if (xneg || (ux == ur)) return r; + else + /* We threw some bits away and x was positive */ + return r + 1.0F; + } +} + +weak_alias (__ceilf, ceilf)
diff --git a/src/cosh.c b/src/cosh.c new file mode 100644 index 0000000..6f8734b --- /dev/null +++ b/src/cosh.c
@@ -0,0 +1,359 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_SPLITEXP +#define USE_SCALEDOUBLE_1 +#define USE_SCALEDOUBLE_2 +#define USE_INFINITY_WITH_FLAGS +#define USE_VAL_WITH_FLAGS +#define USE_HANDLE_ERROR +#include "../inc/libm_inlines_amd.h" +#undef USE_HANDLE_ERROR +#undef USE_SPLITEXP +#undef USE_SCALEDOUBLE_1 +#undef USE_SCALEDOUBLE_2 +#undef USE_INFINITY_WITH_FLAGS +#undef USE_VAL_WITH_FLAGS + +#include "../inc/libm_errno_amd.h" +#ifndef WINDOWS +/* Deal with errno for out-of-range result */ +static inline double retval_errno_erange(double x) +{ + struct exception exc; + exc.arg1 = x; + exc.arg2 = x; + exc.type = OVERFLOW; + exc.name = (char *)"cosh"; + if (_LIB_VERSION == _SVID_) + { + exc.retval = HUGE; + } + else + { + exc.retval = infinity_with_flags(AMD_F_OVERFLOW); + } + if (_LIB_VERSION == _POSIX_) + __set_errno(ERANGE); + else if (!matherr(&exc)) + __set_errno(ERANGE); + return exc.retval; +} +#endif + +double FN_PROTOTYPE(cosh)(double x) +{ + /* + Derived from sinh subroutine + + After dealing with special cases the computation is split into + regions as follows: + + abs(x) >= max_cosh_arg: + cosh(x) = sign(x)*Inf + + abs(x) >= small_threshold: + cosh(x) = sign(x)*exp(abs(x))/2 computed using the + splitexp and scaleDouble functions as for exp_amd(). + + abs(x) < small_threshold: + compute p = exp(y) - 1 and then z = 0.5*(p+(p/(p+1.0))) + cosh(x) is then sign(x)*z. */ + + static const double + max_cosh_arg = 7.10475860073943977113e+02, /* 0x408633ce8fb9f87e */ + thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */ + log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */ + log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */ +// small_threshold = 8*BASEDIGITS_DP64*0.30102999566398119521373889; + small_threshold = 20.0; + /* (8*BASEDIGITS_DP64*log10of2) ' exp(-x) insignificant c.f. exp(x) */ + + /* Lead and tail tabulated values of sinh(i) and cosh(i) + for i = 0,...,36. The lead part has 26 leading bits. */ + + static const double sinh_lead[ 37] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 1.17520117759704589844e+00, /* 0x3ff2cd9fc0000000 */ + 3.62686038017272949219e+00, /* 0x400d03cf60000000 */ + 1.00178747177124023438e+01, /* 0x40240926e0000000 */ + 2.72899169921875000000e+01, /* 0x403b4a3800000000 */ + 7.42032089233398437500e+01, /* 0x40528d0160000000 */ + 2.01713153839111328125e+02, /* 0x406936d228000000 */ + 5.48316116333007812500e+02, /* 0x4081228768000000 */ + 1.49047882080078125000e+03, /* 0x409749ea50000000 */ + 4.05154187011718750000e+03, /* 0x40afa71570000000 */ + 1.10132326660156250000e+04, /* 0x40c5829dc8000000 */ + 2.99370708007812500000e+04, /* 0x40dd3c4488000000 */ + 8.13773945312500000000e+04, /* 0x40f3de1650000000 */ + 2.21206695312500000000e+05, /* 0x410b00b590000000 */ + 6.01302140625000000000e+05, /* 0x412259ac48000000 */ + 1.63450865625000000000e+06, /* 0x4138f0cca8000000 */ + 4.44305525000000000000e+06, /* 0x4150f2ebd0000000 */ + 1.20774762500000000000e+07, /* 0x4167093488000000 */ + 3.28299845000000000000e+07, /* 0x417f4f2208000000 */ + 8.92411500000000000000e+07, /* 0x419546d8f8000000 */ + 2.42582596000000000000e+08, /* 0x41aceb0888000000 */ + 6.59407856000000000000e+08, /* 0x41c3a6e1f8000000 */ + 1.79245641600000000000e+09, /* 0x41dab5adb8000000 */ + 4.87240166400000000000e+09, /* 0x41f226af30000000 */ + 1.32445608960000000000e+10, /* 0x4208ab7fb0000000 */ + 3.60024494080000000000e+10, /* 0x4220c3d390000000 */ + 9.78648043520000000000e+10, /* 0x4236c93268000000 */ + 2.66024116224000000000e+11, /* 0x424ef822f0000000 */ + 7.23128516608000000000e+11, /* 0x42650bba30000000 */ + 1.96566712320000000000e+12, /* 0x427c9aae40000000 */ + 5.34323724288000000000e+12, /* 0x4293704708000000 */ + 1.45244246507520000000e+13, /* 0x42aa6b7658000000 */ + 3.94814795284480000000e+13, /* 0x42c1f43fc8000000 */ + 1.07321789251584000000e+14, /* 0x42d866f348000000 */ + 2.91730863685632000000e+14, /* 0x42f0953e28000000 */ + 7.93006722514944000000e+14, /* 0x430689e220000000 */ + 2.15561576592179200000e+15}; /* 0x431ea215a0000000 */ + + static const double sinh_tail[ 37] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 1.60467555584448807892e-08, /* 0x3e513ae6096a0092 */ + 2.76742892754807136947e-08, /* 0x3e5db70cfb79a640 */ + 2.09697499555224576530e-07, /* 0x3e8c2526b66dc067 */ + 2.04940252448908240062e-07, /* 0x3e8b81b18647f380 */ + 1.65444891522700935932e-06, /* 0x3ebbc1cdd1e1eb08 */ + 3.53116789999998198721e-06, /* 0x3ecd9f201534fb09 */ + 6.94023870987375490695e-06, /* 0x3edd1c064a4e9954 */ + 4.98876893611587449271e-06, /* 0x3ed4eca65d06ea74 */ + 3.19656024605152215752e-05, /* 0x3f00c259bcc0ecc5 */ + 2.08687768377236501204e-04, /* 0x3f2b5a6647cf9016 */ + 4.84668088325403796299e-05, /* 0x3f09691adefb0870 */ + 1.17517985422733832468e-03, /* 0x3f53410fc29cde38 */ + 6.90830086959560562415e-04, /* 0x3f46a31a50b6fb3c */ + 1.45697262451506548420e-03, /* 0x3f57defc71805c40 */ + 2.99859023684906737806e-02, /* 0x3f9eb49fd80e0bab */ + 1.02538800507941396667e-02, /* 0x3f84fffc7bcd5920 */ + 1.26787628407699110022e-01, /* 0x3fc03a93b6c63435 */ + 6.86652479544033744752e-02, /* 0x3fb1940bb255fd1c */ + 4.81593627621056619148e-01, /* 0x3fded26e14260b50 */ + 1.70489513795397629181e+00, /* 0x3ffb47401fc9f2a2 */ + 1.12416073482258713767e+01, /* 0x40267bb3f55634f1 */ + 7.06579578070110514432e+00, /* 0x401c435ff8194ddc */ + 5.91244512999659974639e+01, /* 0x404d8fee052ba63a */ + 1.68921736147050694399e+02, /* 0x40651d7edccde3f6 */ + 2.60692936262073658327e+02, /* 0x40704b1644557d1a */ + 3.62419382134885609048e+02, /* 0x4076a6b5ca0a9dc4 */ + 4.07689930834187271103e+03, /* 0x40afd9cc72249aba */ + 1.55377375868385224749e+04, /* 0x40ce58de693edab5 */ + 2.53720210371943067003e+04, /* 0x40d8c70158ac6363 */ + 4.78822310734952334315e+04, /* 0x40e7614764f43e20 */ + 1.81871712615542812273e+05, /* 0x4106337db36fc718 */ + 5.62892347580489004031e+05, /* 0x41212d98b1f611e2 */ + 6.41374032312148716301e+05, /* 0x412392bc108b37cc */ + 7.57809544070145115256e+06, /* 0x415ce87bdc3473dc */ + 3.64177136406482197344e+06, /* 0x414bc8d5ae99ad14 */ + 7.63580561355670914054e+06}; /* 0x415d20d76744835c */ + + static const double cosh_lead[ 37] = { + 1.00000000000000000000e+00, /* 0x3ff0000000000000 */ + 1.54308062791824340820e+00, /* 0x3ff8b07550000000 */ + 3.76219564676284790039e+00, /* 0x400e18fa08000000 */ + 1.00676617622375488281e+01, /* 0x402422a490000000 */ + 2.73082327842712402344e+01, /* 0x403b4ee858000000 */ + 7.42099475860595703125e+01, /* 0x40528d6fc8000000 */ + 2.01715633392333984375e+02, /* 0x406936e678000000 */ + 5.48317031860351562500e+02, /* 0x4081228948000000 */ + 1.49047915649414062500e+03, /* 0x409749eaa8000000 */ + 4.05154199218750000000e+03, /* 0x40afa71580000000 */ + 1.10132329101562500000e+04, /* 0x40c5829dd0000000 */ + 2.99370708007812500000e+04, /* 0x40dd3c4488000000 */ + 8.13773945312500000000e+04, /* 0x40f3de1650000000 */ + 2.21206695312500000000e+05, /* 0x410b00b590000000 */ + 6.01302140625000000000e+05, /* 0x412259ac48000000 */ + 1.63450865625000000000e+06, /* 0x4138f0cca8000000 */ + 4.44305525000000000000e+06, /* 0x4150f2ebd0000000 */ + 1.20774762500000000000e+07, /* 0x4167093488000000 */ + 3.28299845000000000000e+07, /* 0x417f4f2208000000 */ + 8.92411500000000000000e+07, /* 0x419546d8f8000000 */ + 2.42582596000000000000e+08, /* 0x41aceb0888000000 */ + 6.59407856000000000000e+08, /* 0x41c3a6e1f8000000 */ + 1.79245641600000000000e+09, /* 0x41dab5adb8000000 */ + 4.87240166400000000000e+09, /* 0x41f226af30000000 */ + 1.32445608960000000000e+10, /* 0x4208ab7fb0000000 */ + 3.60024494080000000000e+10, /* 0x4220c3d390000000 */ + 9.78648043520000000000e+10, /* 0x4236c93268000000 */ + 2.66024116224000000000e+11, /* 0x424ef822f0000000 */ + 7.23128516608000000000e+11, /* 0x42650bba30000000 */ + 1.96566712320000000000e+12, /* 0x427c9aae40000000 */ + 5.34323724288000000000e+12, /* 0x4293704708000000 */ + 1.45244246507520000000e+13, /* 0x42aa6b7658000000 */ + 3.94814795284480000000e+13, /* 0x42c1f43fc8000000 */ + 1.07321789251584000000e+14, /* 0x42d866f348000000 */ + 2.91730863685632000000e+14, /* 0x42f0953e28000000 */ + 7.93006722514944000000e+14, /* 0x430689e220000000 */ + 2.15561576592179200000e+15}; /* 0x431ea215a0000000 */ + + static const double cosh_tail[ 37] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 6.89700037027478056904e-09, /* 0x3e3d9f5504c2bd28 */ + 4.43207835591715833630e-08, /* 0x3e67cb66f0a4c9fd */ + 2.33540217013828929694e-07, /* 0x3e8f58617928e588 */ + 5.17452463948269748331e-08, /* 0x3e6bc7d000c38d48 */ + 9.38728274131605919153e-07, /* 0x3eaf7f9d4e329998 */ + 2.73012191010840495544e-06, /* 0x3ec6e6e464885269 */ + 3.29486051438996307950e-06, /* 0x3ecba3a8b946c154 */ + 4.75803746362771416375e-06, /* 0x3ed3f4e76110d5a4 */ + 3.33050940471947692369e-05, /* 0x3f017622515a3e2b */ + 9.94707313972136215365e-06, /* 0x3ee4dc4b528af3d0 */ + 6.51685096227860253398e-05, /* 0x3f11156278615e10 */ + 1.18132406658066663359e-03, /* 0x3f535ad50ed821f5 */ + 6.93090416366541877541e-04, /* 0x3f46b61055f2935c */ + 1.45780415323416845386e-03, /* 0x3f57e2794a601240 */ + 2.99862082708111758744e-02, /* 0x3f9eb4b45f6aadd3 */ + 1.02539925859688602072e-02, /* 0x3f85000b967b3698 */ + 1.26787669807076286421e-01, /* 0x3fc03a940fadc092 */ + 6.86652631843830962843e-02, /* 0x3fb1940bf3bf874c */ + 4.81593633223853068159e-01, /* 0x3fded26e1a2a2110 */ + 1.70489514001513020602e+00, /* 0x3ffb4740205796d6 */ + 1.12416073489841270572e+01, /* 0x40267bb3f55cb85d */ + 7.06579578098005001152e+00, /* 0x401c435ff81e18ac */ + 5.91244513000686140458e+01, /* 0x404d8fee052bdea4 */ + 1.68921736147088438429e+02, /* 0x40651d7edccde926 */ + 2.60692936262087528121e+02, /* 0x40704b1644557e0e */ + 3.62419382134890611269e+02, /* 0x4076a6b5ca0a9e1c */ + 4.07689930834187453002e+03, /* 0x40afd9cc72249abe */ + 1.55377375868385224749e+04, /* 0x40ce58de693edab5 */ + 2.53720210371943103382e+04, /* 0x40d8c70158ac6364 */ + 4.78822310734952334315e+04, /* 0x40e7614764f43e20 */ + 1.81871712615542812273e+05, /* 0x4106337db36fc718 */ + 5.62892347580489004031e+05, /* 0x41212d98b1f611e2 */ + 6.41374032312148716301e+05, /* 0x412392bc108b37cc */ + 7.57809544070145115256e+06, /* 0x415ce87bdc3473dc */ + 3.64177136406482197344e+06, /* 0x414bc8d5ae99ad14 */ + 7.63580561355670914054e+06}; /* 0x415d20d76744835c */ + + unsigned long long ux, aux, xneg; + double y, z, z1, z2; + int m; + + /* Special cases */ + + GET_BITS_DP64(x, ux); + aux = ux & ~SIGNBIT_DP64; + if (aux < 0x3e30000000000000) /* |x| small enough that cosh(x) = 1 */ + { + if (aux == 0) + /* with no inexact */ + return 1.0; + else + return val_with_flags(1.0, AMD_F_INEXACT); + } + else if (aux >= PINFBITPATT_DP64) /* |x| is NaN or Inf */ + { + if (aux > PINFBITPATT_DP64) /* |x| is a NaN? */ + return x + x; + else /* x is infinity */ + return infinity_with_flags(0); + } + + xneg = (aux != ux); + + y = x; + if (xneg) y = -x; + + if (y >= max_cosh_arg) + { + /* Return +/-infinity with overflow flag */ +#ifdef WINDOWS + return handle_error("cosh", PINFBITPATT_DP64, _OVERFLOW, + AMD_F_OVERFLOW, EDOM, x, 0.0F); +#else + return retval_errno_erange(x); +#endif + + + } + else if (y >= small_threshold) + { + /* In this range y is large enough so that + the negative exponential is negligible, + so cosh(y) is approximated by sign(x)*exp(y)/2. The + code below is an inlined version of that from + exp() with two changes (it operates on + y instead of x, and the division by 2 is + done by reducing m by 1). */ + + splitexp(y, 1.0, thirtytwo_by_log2, log2_by_32_lead, + log2_by_32_tail, &m, &z1, &z2); + m -= 1; + + if (m >= EMIN_DP64 && m <= EMAX_DP64) + z = scaleDouble_1((z1+z2),m); + else + z = scaleDouble_2((z1+z2),m); + } + else + { + /* In this range we find the integer part y0 of y + and the increment dy = y - y0. We then compute + + z = sinh(y) = sinh(y0)cosh(dy) + cosh(y0)sinh(dy) + z = cosh(y) = cosh(y0)cosh(dy) + sinh(y0)sinh(dy) + + where sinh(y0) and cosh(y0) are tabulated above. */ + + int ind; + double dy, dy2, sdy, cdy; + + ind = (int)y; + dy = y - ind; + + dy2 = dy*dy; + sdy = dy*dy2*(0.166666666666666667013899e0 + + (0.833333333333329931873097e-2 + + (0.198412698413242405162014e-3 + + (0.275573191913636406057211e-5 + + (0.250521176994133472333666e-7 + + (0.160576793121939886190847e-9 + + 0.7746188980094184251527126e-12*dy2)*dy2)*dy2)*dy2)*dy2)*dy2); + + cdy = dy2*(0.500000000000000005911074e0 + + (0.416666666666660876512776e-1 + + (0.138888888889814854814536e-2 + + (0.248015872460622433115785e-4 + + (0.275573350756016588011357e-6 + + (0.208744349831471353536305e-8 + + 0.1163921388172173692062032e-10*dy2)*dy2)*dy2)*dy2)*dy2)*dy2); + + /* At this point sinh(dy) is approximated by dy + sdy, and cosh(dy) is approximated by 1 + cdy. + Shift some significant bits from dy to cdy. */ + z = ((((((cosh_tail[ind]*cdy + sinh_tail[ind]*sdy) + + sinh_tail[ind]*dy) + cosh_tail[ind]) + + cosh_lead[ind]*cdy) + sinh_lead[ind]*sdy) + + sinh_lead[ind]*dy) + cosh_lead[ind]; + } + + return z; +} + +weak_alias (__cosh, cosh)
diff --git a/src/coshf.c b/src/coshf.c new file mode 100644 index 0000000..ab2b68e --- /dev/null +++ b/src/coshf.c
@@ -0,0 +1,268 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_SPLITEXP +#define USE_SCALEDOUBLE_1 +#define USE_SCALEDOUBLE_2 +#define USE_INFINITYF_WITH_FLAGS +#define USE_VALF_WITH_FLAGS +#include "../inc/libm_inlines_amd.h" +#undef USE_SPLITEXP +#undef USE_SCALEDOUBLE_1 +#undef USE_SCALEDOUBLE_2 +#undef USE_INFINITYF_WITH_FLAGS +#undef USE_VALF_WITH_FLAGS + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range result */ +static inline float retval_errno_erange(float x) +{ + struct exception exc; + exc.arg1 = (double)x; + exc.arg2 = (double)x; + exc.type = OVERFLOW; + exc.name = (char *)"coshf"; + if (_LIB_VERSION == _SVID_) + { + exc.retval = HUGE; + } + else + { + exc.retval = infinityf_with_flags(AMD_F_OVERFLOW); + } + if (_LIB_VERSION == _POSIX_) + __set_errno(ERANGE); + else if (!matherr(&exc)) + __set_errno(ERANGE); + return exc.retval; +} + +#endif +float FN_PROTOTYPE(coshf)(float fx) +{ + /* + After dealing with special cases the computation is split into + regions as follows: + + abs(x) >= max_cosh_arg: + cosh(x) = sign(x)*Inf + + abs(x) >= small_threshold: + cosh(x) = sign(x)*exp(abs(x))/2 computed using the + splitexp and scaleDouble functions as for exp_amd(). + + abs(x) < small_threshold: + compute p = exp(y) - 1 and then z = 0.5*(p+(p/(p+1.0))) + cosh(x) is then sign(x)*z. */ + + static const double + /* The max argument of coshf, but stored as a double */ + max_cosh_arg = 8.94159862922329438106e+01, /* 0x40565a9f84f82e63 */ + thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */ + log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */ + log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */ + + small_threshold = 8*BASEDIGITS_DP64*0.30102999566398119521373889; +// small_threshold = 20.0; + /* (8*BASEDIGITS_DP64*log10of2) ' exp(-x) insignificant c.f. exp(x) */ + + /* Tabulated values of sinh(i) and cosh(i) for i = 0,...,36. */ + + static const double sinh_lead[ 37] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 1.17520119364380137839e+00, /* 0x3ff2cd9fc44eb982 */ + 3.62686040784701857476e+00, /* 0x400d03cf63b6e19f */ + 1.00178749274099008204e+01, /* 0x40240926e70949ad */ + 2.72899171971277496596e+01, /* 0x403b4a3803703630 */ + 7.42032105777887522891e+01, /* 0x40528d0166f07374 */ + 2.01713157370279219549e+02, /* 0x406936d22f67c805 */ + 5.48316123273246489589e+02, /* 0x408122876ba380c9 */ + 1.49047882578955000099e+03, /* 0x409749ea514eca65 */ + 4.05154190208278987484e+03, /* 0x40afa7157430966f */ + 1.10132328747033916443e+04, /* 0x40c5829dced69991 */ + 2.99370708492480553105e+04, /* 0x40dd3c4488cb48d6 */ + 8.13773957064298447222e+04, /* 0x40f3de1654d043f0 */ + 2.21206696003330085659e+05, /* 0x410b00b5916a31a5 */ + 6.01302142081972560845e+05, /* 0x412259ac48bef7e3 */ + 1.63450868623590236530e+06, /* 0x4138f0ccafad27f6 */ + 4.44305526025387924165e+06, /* 0x4150f2ebd0a7ffe3 */ + 1.20774763767876271158e+07, /* 0x416709348c0ea4ed */ + 3.28299845686652474105e+07, /* 0x417f4f22091940bb */ + 8.92411504815936237574e+07, /* 0x419546d8f9ed26e1 */ + 2.42582597704895108938e+08, /* 0x41aceb088b68e803 */ + 6.59407867241607308388e+08, /* 0x41c3a6e1fd9eecfd */ + 1.79245642306579566002e+09, /* 0x41dab5adb9c435ff */ + 4.87240172312445068359e+09, /* 0x41f226af33b1fdc0 */ + 1.32445610649217357635e+10, /* 0x4208ab7fb5475fb7 */ + 3.60024496686929321289e+10, /* 0x4220c3d3920962c8 */ + 9.78648047144193725586e+10, /* 0x4236c932696a6b5c */ + 2.66024120300899291992e+11, /* 0x424ef822f7f6731c */ + 7.23128532145737548828e+11, /* 0x42650bba3796379a */ + 1.96566714857202099609e+12, /* 0x427c9aae4631c056 */ + 5.34323729076223046875e+12, /* 0x429370470aec28ec */ + 1.45244248326237109375e+13, /* 0x42aa6b765d8cdf6c */ + 3.94814800913403437500e+13, /* 0x42c1f43fcc4b662c */ + 1.07321789892958031250e+14, /* 0x42d866f34a725782 */ + 2.91730871263727437500e+14, /* 0x42f0953e2f3a1ef7 */ + 7.93006726156715250000e+14, /* 0x430689e221bc8d5a */ + 2.15561577355759750000e+15}; /* 0x431ea215a1d20d76 */ + + static const double cosh_lead[ 37] = { + 1.00000000000000000000e+00, /* 0x3ff0000000000000 */ + 1.54308063481524371241e+00, /* 0x3ff8b07551d9f550 */ + 3.76219569108363138810e+00, /* 0x400e18fa0df2d9bc */ + 1.00676619957777653269e+01, /* 0x402422a497d6185e */ + 2.73082328360164865444e+01, /* 0x403b4ee858de3e80 */ + 7.42099485247878334349e+01, /* 0x40528d6fcbeff3a9 */ + 2.01715636122455890700e+02, /* 0x406936e67db9b919 */ + 5.48317035155212010977e+02, /* 0x4081228949ba3a8b */ + 1.49047916125217807348e+03, /* 0x409749eaa93f4e76 */ + 4.05154202549259389343e+03, /* 0x40afa715845d8894 */ + 1.10132329201033226127e+04, /* 0x40c5829dd053712d */ + 2.99370708659497577173e+04, /* 0x40dd3c4489115627 */ + 8.13773957125740562333e+04, /* 0x40f3de1654d6b543 */ + 2.21206696005590405548e+05, /* 0x410b00b5916b6105 */ + 6.01302142082804115489e+05, /* 0x412259ac48bf13ca */ + 1.63450868623620807193e+06, /* 0x4138f0ccafad2d17 */ + 4.44305526025399193168e+06, /* 0x4150f2ebd0a8005c */ + 1.20774763767876680940e+07, /* 0x416709348c0ea503 */ + 3.28299845686652623117e+07, /* 0x417f4f22091940bf */ + 8.92411504815936237574e+07, /* 0x419546d8f9ed26e1 */ + 2.42582597704895138741e+08, /* 0x41aceb088b68e804 */ + 6.59407867241607308388e+08, /* 0x41c3a6e1fd9eecfd */ + 1.79245642306579566002e+09, /* 0x41dab5adb9c435ff */ + 4.87240172312445068359e+09, /* 0x41f226af33b1fdc0 */ + 1.32445610649217357635e+10, /* 0x4208ab7fb5475fb7 */ + 3.60024496686929321289e+10, /* 0x4220c3d3920962c8 */ + 9.78648047144193725586e+10, /* 0x4236c932696a6b5c */ + 2.66024120300899291992e+11, /* 0x424ef822f7f6731c */ + 7.23128532145737548828e+11, /* 0x42650bba3796379a */ + 1.96566714857202099609e+12, /* 0x427c9aae4631c056 */ + 5.34323729076223046875e+12, /* 0x429370470aec28ec */ + 1.45244248326237109375e+13, /* 0x42aa6b765d8cdf6c */ + 3.94814800913403437500e+13, /* 0x42c1f43fcc4b662c */ + 1.07321789892958031250e+14, /* 0x42d866f34a725782 */ + 2.91730871263727437500e+14, /* 0x42f0953e2f3a1ef7 */ + 7.93006726156715250000e+14, /* 0x430689e221bc8d5a */ + 2.15561577355759750000e+15}; /* 0x431ea215a1d20d76 */ + + unsigned long long ux, aux, xneg; + double x = fx, y, z, z1, z2; + int m; + + /* Special cases */ + + GET_BITS_DP64(x, ux); + aux = ux & ~SIGNBIT_DP64; + if (aux < 0x3f10000000000000) /* |x| small enough that cosh(x) = 1 */ + { + if (aux == 0) return (float)1.0; /* with no inexact */ + if (LAMBDA_DP64 + x > 1.0) return valf_with_flags((float)1.0, AMD_F_INEXACT); /* with inexact */ + } + else if (aux >= PINFBITPATT_DP64) /* |x| is NaN or Inf */ + { + if (aux > PINFBITPATT_DP64) /* |x| is a NaN? */ + return fx + fx; + else /* x is infinity */ + return infinityf_with_flags(0); + } + + xneg = (aux != ux); + + y = x; + if (xneg) y = -x; + + if (y >= max_cosh_arg) + { + /* Return infinity with overflow flag. */ + /* This handles POSIX behaviour */ + __set_errno(ERANGE); + z = infinityf_with_flags(AMD_F_OVERFLOW); + } + else if (y >= small_threshold) + { + /* In this range y is large enough so that + the negative exponential is negligible, + so cosh(y) is approximated by sign(x)*exp(y)/2. The + code below is an inlined version of that from + exp() with two changes (it operates on + y instead of x, and the division by 2 is + done by reducing m by 1). */ + + splitexp(y, 1.0, thirtytwo_by_log2, log2_by_32_lead, + log2_by_32_tail, &m, &z1, &z2); + m -= 1; + + /* scaleDouble_1 is always safe because the argument x was + float, rather than double */ + + z = scaleDouble_1((z1+z2),m); + } + else + { + /* In this range we find the integer part y0 of y + and the increment dy = y - y0. We then compute + + z = sinh(y) = sinh(y0)cosh(dy) + cosh(y0)sinh(dy) + z = cosh(y) = cosh(y0)cosh(dy) + sinh(y0)sinh(dy) + + where sinh(y0) and cosh(y0) are tabulated above. */ + + int ind; + double dy, dy2, sdy, cdy; + + ind = (int)y; + dy = y - ind; + + dy2 = dy*dy; + + sdy = dy + dy*dy2*(0.166666666666666667013899e0 + + (0.833333333333329931873097e-2 + + (0.198412698413242405162014e-3 + + (0.275573191913636406057211e-5 + + (0.250521176994133472333666e-7 + + (0.160576793121939886190847e-9 + + 0.7746188980094184251527126e-12*dy2)*dy2)*dy2)*dy2)*dy2)*dy2); + + cdy = 1 + dy2*(0.500000000000000005911074e0 + + (0.416666666666660876512776e-1 + + (0.138888888889814854814536e-2 + + (0.248015872460622433115785e-4 + + (0.275573350756016588011357e-6 + + (0.208744349831471353536305e-8 + + 0.1163921388172173692062032e-10*dy2)*dy2)*dy2)*dy2)*dy2)*dy2); + + z = cosh_lead[ind]*cdy + sinh_lead[ind]*sdy; + } + +// if (xneg) z = - z; + return (float)z; +} + +weak_alias (__coshf, coshf)
diff --git a/src/exp_special.c b/src/exp_special.c new file mode 100644 index 0000000..ca32ec2 --- /dev/null +++ b/src/exp_special.c
@@ -0,0 +1,110 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + +#ifdef __x86_64__ + +#include <emmintrin.h> +#include <math.h> +#include <errno.h> + +#include "../inc/libm_util_amd.h" +#include "../inc/libm_special.h" + +// y = expf(x) +// y = exp(x) + +// these codes and the ones in the related .S or .asm files have to match +#define EXP_X_NAN 1 +#define EXP_Y_ZERO 2 +#define EXP_Y_INF 3 + +float _expf_special(float x, float y, U32 code) +{ + switch(code) + { + case EXP_X_NAN: + { +#ifdef WIN64 + // y is assumed to be qnan, only check x for snan + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, EDOM, "expf", x, is_x_snan, 0.0f, 0, y, 0); +#else + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); +#endif + } + break; + + case EXP_Y_ZERO: + { + _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_UNDERFLOW)); + __amd_handle_errorf(UNDERFLOW, ERANGE, "expf", x, 0, 0.0f, 0, y, 0); + } + break; + + case EXP_Y_INF: + { + _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_OVERFLOW)); + __amd_handle_errorf(OVERFLOW, ERANGE, "expf", x, 0, 0.0f, 0, y, 0); + } + break; + } + + + return y; +} + +double _exp_special(double x, double y, U32 code) +{ + switch(code) + { + case EXP_X_NAN: + { +#ifdef WIN64 + __amd_handle_error(DOMAIN, EDOM, "exp", x, 0.0, y); +#else + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); +#endif + } + break; + + case EXP_Y_ZERO: + { + _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_UNDERFLOW)); + __amd_handle_error(UNDERFLOW, ERANGE, "exp", x, 0.0, y); + } + break; + + case EXP_Y_INF: + { + _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_OVERFLOW)); + __amd_handle_error(OVERFLOW, ERANGE, "exp", x, 0.0, y); + } + break; + } + + + return y; +} + +#endif /* __x86_64__ */
diff --git a/src/finite.c b/src/finite.c new file mode 100644 index 0000000..7e7ca39 --- /dev/null +++ b/src/finite.c
@@ -0,0 +1,60 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +/* Returns 0 if x is infinite or NaN, otherwise returns 1 */ + +#ifdef WINDOWS +int FN_PROTOTYPE(finite)(double x) +#else +int FN_PROTOTYPE(finite)(double x) +#endif +{ + +#ifdef WINDOWS + + unsigned long long ux; + GET_BITS_DP64(x, ux); + return (int)(((ux & ~SIGNBIT_DP64) - PINFBITPATT_DP64) >> 63); + +#else + + /* This works on Hammer with gcc */ + unsigned long ux =0x7ff0000000000000 ; + double temp; + PUT_BITS_DP64(ux, temp); + + // double temp = 1.0e444; /* = infinity = 0x7ff0000000000000 */ + volatile int retval; + retval = 0; + asm volatile ("andpd %0, %1;" : : "x" (temp), "x" (x)); + asm volatile ("comisd %0, %1" : : "x" (temp), "x" (x)); + asm volatile ("setnz %0" : "=g" (retval)); + return retval; + +#endif +} + +weak_alias (__finite, finite)
diff --git a/src/finitef.c b/src/finitef.c new file mode 100644 index 0000000..8c0613a --- /dev/null +++ b/src/finitef.c
@@ -0,0 +1,60 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +/* Returns 0 if x is infinite or NaN, otherwise returns 1 */ + +#ifdef WINDOWS +int FN_PROTOTYPE(finitef)(float x) +#else +int FN_PROTOTYPE(finitef)(float x) +#endif +{ + +#ifdef WINDOWS + + unsigned int ux; + GET_BITS_SP32(x, ux); + return (int)(((ux & ~SIGNBIT_SP32) - PINFBITPATT_SP32) >> 31); + +#else + + /* This works on Hammer */ + unsigned int ux=0x7f800000; + float temp; + PUT_BITS_SP32(ux, temp); + + /* float temp = 1.0e444; *//* = infinity = 0x7f800000 */ + volatile int retval; + retval = 0; + asm volatile ("andps %0, %1;" : : "x" (temp), "x" (x)); + asm volatile ("comiss %0, %1" : : "x" (temp), "x" (x)); + asm volatile ("setnz %0" : "=g" (retval)); + return retval; + +#endif +} + +weak_alias (__finitef, finitef)
diff --git a/src/floor.c b/src/floor.c new file mode 100644 index 0000000..a1b99c5 --- /dev/null +++ b/src/floor.c
@@ -0,0 +1,92 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#ifdef WINDOWS +#include "../inc/libm_errno_amd.h" +#define USE_HANDLE_ERROR +#include "../inc/libm_inlines_amd.h" +#undef USE_HANDLE_ERROR +#endif + +#ifdef WINDOWS +#pragma function(floor) +#endif + +double FN_PROTOTYPE(floor)(double x) +{ + double r; + long long rexp, xneg; + + + unsigned long long ux, ax, ur, mask; + + GET_BITS_DP64(x, ux); + ax = ux & (~SIGNBIT_DP64); + xneg = (ux != ax); + + if (ax >= 0x4340000000000000) + { + /* abs(x) is either NaN, infinity, or >= 2^53 */ + if (ax > 0x7ff0000000000000) + /* x is NaN */ +#ifdef WINDOWS + return handle_error("floor", ux|0x0008000000000000, _DOMAIN, + 0, EDOM, x, 0.0); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + else + return x; + } + else if (ax < 0x3ff0000000000000) /* abs(x) < 1.0 */ + { + if (ax == 0x0000000000000000) + /* x is +zero or -zero; return the same zero */ + return x; + else if (xneg) /* x < 0.0 */ + return -1.0; + else + return 0.0; + } + else + { + r = x; + rexp = ((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64; + /* Mask out the bits of r that we don't want */ + mask = 1; + mask = (mask << (EXPSHIFTBITS_DP64 - rexp)) - 1; + ur = (ux & ~mask); + PUT_BITS_DP64(ur, r); + if (xneg && (ur != ux)) + /* We threw some bits away and x was negative */ + return r - 1.0; + else + return r; + } + +} + +weak_alias (__floor, floor)
diff --git a/src/floorf.c b/src/floorf.c new file mode 100644 index 0000000..e0f855b --- /dev/null +++ b/src/floorf.c
@@ -0,0 +1,87 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#ifdef WINDOWS +#include "../inc/libm_errno_amd.h" +#define USE_HANDLE_ERRORF +#include "../inc/libm_inlines_amd.h" +#undef USE_HANDLE_ERRORF +#endif + +#ifdef WINDOWS +#pragma function(floorf) +#endif + +float FN_PROTOTYPE(floorf)(float x) +{ + float r; + int rexp, xneg; + unsigned int ux, ax, ur, mask; + + GET_BITS_SP32(x, ux); + ax = ux & (~SIGNBIT_SP32); + xneg = (ux != ax); + + if (ax >= 0x4b800000) + { + /* abs(x) is either NaN, infinity, or >= 2^24 */ + if (ax > 0x7f800000) + /* x is NaN */ +#ifdef WINDOWS + return handle_errorf("floorf", ux|0x00400000, _DOMAIN, + 0, EDOM, x, 0.0F); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + else + return x; + } + else if (ax < 0x3f800000) /* abs(x) < 1.0 */ + { + if (ax == 0x00000000) + /* x is +zero or -zero; return the same zero */ + return x; + else if (xneg) /* x < 0.0 */ + return -1.0F; + else + return 0.0F; + } + else + { + rexp = ((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32; + /* Mask out the bits of r that we don't want */ + mask = (1 << (EXPSHIFTBITS_SP32 - rexp)) - 1; + ur = (ux & ~mask); + PUT_BITS_SP32(ur, r); + if (xneg && (ux != ur)) + /* We threw some bits away and x was negative */ + return r - 1.0F; + else + return r; + } +} + +weak_alias (__floorf, floorf)
diff --git a/src/frexp.c b/src/frexp.c new file mode 100644 index 0000000..0ae109c --- /dev/null +++ b/src/frexp.c
@@ -0,0 +1,54 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + + +double FN_PROTOTYPE(frexp)(double value, int *exp) +{ + UT64 val; + unsigned int sign; + int exponent; + val.f64 = value; + sign = val.u32[1] & SIGNBIT_SP32; + val.u32[1] = val.u32[1] & ~SIGNBIT_SP32; /* remove the sign bit */ + *exp = 0; + if((val.f64 == 0.0) || ((val.u32[1] & 0x7ff00000)== 0x7ff00000)) + return value; /* value= +-0 or value= nan or value = +-inf return value */ + + exponent = val.u32[1] >> 20; /* get the exponent */ + + if(exponent == 0)/*x is denormal*/ + { + val.f64 = val.f64 * VAL_2PMULTIPLIER_DP;/*multiply by 2^53 to bring it to the normal range*/ + exponent = val.u32[1] >> 20; /* get the exponent */ + exponent = exponent - MULTIPLIER_DP; + } + + exponent -= 1022; /* remove bias(1023)-1 */ + *exp = exponent; /* set the integral power of two */ + val.u32[1] = sign | 0x3fe00000 | (val.u32[1] & 0x000fffff);/* make the fractional part(divide by 2) */ + return val.f64; +} +
diff --git a/src/frexpf.c b/src/frexpf.c new file mode 100644 index 0000000..e2b4ece --- /dev/null +++ b/src/frexpf.c
@@ -0,0 +1,55 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + + + +float FN_PROTOTYPE(frexpf)(float value, int *exp) +{ + UT32 val; + unsigned int sign; + int exponent; + val.f32 = value; + sign = val.u32 & SIGNBIT_SP32; + val.u32 = val.u32 & ~SIGNBIT_SP32; /* remove the sign bit */ + *exp = 0; + if((val.f32 == 0.0) || ((val.u32 & 0x7f800000)== 0x7f800000)) + return value; /* value= +-0 or value= nan or value = +-inf return value */ + + exponent = val.u32 >> 23; /* get the exponent */ + + if(exponent == 0)/*x is denormal*/ + { + val.f32 = val.f32 * VAL_2PMULTIPLIER_SP;/*multiply by 2^24 to bring it to the normal range*/ + exponent = (val.u32 >> 23); /* get the exponent */ + exponent = exponent - MULTIPLIER_SP; + } + + exponent -= 126; /* remove bias(127)-1 */ + *exp = exponent; /* set the integral power of two */ + val.u32 = sign | 0x3f000000 | (val.u32 & 0x007fffff);/* make the fractional part(divide by 2) */ + return val.f32; +} +
diff --git a/src/gas/cbrt.S b/src/gas/cbrt.S new file mode 100644 index 0000000..b733a1a --- /dev/null +++ b/src/gas/cbrt.S
@@ -0,0 +1,1575 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# cbrt.S +# +# An implementation of the cbrt libm function. +# +# Prototype: +# +# double cbrt(double x); +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(cbrt) +#define fname_special _cbrt_special + + +# local variable storage offsets + +.equ store_input, -0x10 +.equ stack_size, 0x20 + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 32 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + xor %rdx,%rdx + #for the time being the stack pointer is not changed at all + #Assuming that this is a leaf procedure we can avoid the decrementing and incrementing + #of the stack pointer. This will save some assembly operations and give us good performance + #results. If there is a procedure call then we need to look at the changes in the stack pointer. + #sub $stack_size, %rsp + movd %xmm0,%rax + movsd %xmm0,%xmm6 + mov .L__exp_mask_64(%rip),%r10 + mov .L__mantissa_mask_64(%rip),%r11 + mov %rax,%r9 + and %r10,%rax # rax = stores the exponent + and %r11,%r9 # r9 = stores the mantissa + shr $52,%rax + cmp $0X7FF,%rax + jz .L__cbrt_is_Nan_Infinite + cmp $0X0,%rax + jz .L__cbrt_is_denormal + +.align 32 +.L__cbrt_is_normal: + mov $3,%rcx # cx is set to 3 to perform division and get the scale and remainder + pand .L__sign_bit_64(%rip),%xmm6 # xmm6 contains the sign + sub $0x3FF,%ax + #we don't need the compare as sub instruction will raise the flags. But there was no performance improvement + cmp $0,%ax + jge .L__donot_change_dx + not %dx +.L__donot_change_dx: + idiv %cx #Accumulator is divided by bl=3 + #ax contains the quotient + #dx contains the remainder + mov %dx,%cx + add $0x3FF,%ax + shl $52,%rax + add $2,%cx + shl $1,%cx + #ax = Contains the quotient, Scale factor + mov %rax,store_input(%rsp) + movsd store_input(%rsp),%xmm7 #xmm7 is the scaling factor = mf + #xmm0 is the modified input value from the denaormal cases + pand .L__mantissa_mask_64(%rip),%xmm0 + por .L__zero_point_five(%rip),%xmm0 #xmm0 = Y + mov %r9,%r10 + shr $43,%r10 + shr $44,%r9 + and $0x01,%r10 + or $0x0100,%r9 + add %r9,%r10 #r10 = index_u64 + cvtsi2sd %r10,%xmm4 #xmm4 = index_f64 + sub $256,%r10 + lea .L__INV_TAB_256(%rip),%rax + mulsd .L__one_by_512(%rip), %xmm4 #xmm4 = F + subsd %xmm4,%xmm0 # xmm0 = f + movsd (%rax,%r10,8),%xmm4 + mulsd %xmm4,%xmm0 # xmm0 = r + + #Now perform polynomial computation + + # movddup %xmm0,%xmm0 # xmm0 = r ,r + shufpd $0,%xmm0,%xmm0 # replacing movddup + + mulsd %xmm0,%xmm0 # xmm0 = r ,r^2 + + movapd %xmm0,%xmm4 # xmm4 = r ,r^2 + movapd %xmm0,%xmm3 # xmm3 = r ,r^2 + mulpd %xmm0,%xmm0 # xmm0 = r^2,r^4 ######### + mulpd %xmm0,%xmm3 # xmm3 = r^3,r^6 ######### + movapd %xmm3,%xmm2 + mulpd .L__coefficients_3_6(%rip),%xmm2 # xmm2 = [coeff3 * r^3, coeff6 * r^6] + mulpd %xmm0,%xmm3 # xmm3 = r^5,r^10 We don't need r^10 + unpckhpd %xmm3,%xmm4 #xmm4 = r^5,r + mulpd .L__coefficients_2_4(%rip),%xmm0 # xmm0 = [coeff2 * r^2, coeff4 * r^4] + mulpd .L__coefficients_5_1(%rip),%xmm4 # xmm4 = [coeff5 * r^5, coeff1 * r ] + movapd %xmm4,%xmm3 + unpckhpd %xmm3,%xmm3 #xmm3 = [~Don't Care ,coeff5 * r^5] + addsd %xmm3,%xmm2 # xmm2 = [coeff3 * r^3, coeff5 * r^5 + coeff6 * r^6] + addpd %xmm2,%xmm0 # xmm0 = [coeff2 * r^2 + coeff3 * r^3,coeff4 * r^4 + coeff5 * r^5 + coeff6 * r^6] + movapd %xmm0,%xmm2 + unpckhpd %xmm2,%xmm2 #xmm3 = [~Don't Care ,coeff2 * r^2 + coeff3 * r^3] + addsd %xmm2,%xmm0 # xmm0 = [~Don't Care, coeff2 * r^2 + coeff3 * r^3 + coeff4 * r^4 + coeff5 * r^5 + coeff6 * r^6] + addsd %xmm4,%xmm0 # xmm0 = [~Don't Care, coeff1 * r + coeff2 * r^2 + coeff3 * r^3 + coeff4 * r^4 + coeff5 * r^5 + coeff6 * r^6] + + # movddup %xmm0,%xmm0 + shufpd $0,%xmm0,%xmm0 # replacing movddup + + + #Polynomial computation completes here + #Now compute the following + #switch(rem) + #{ + # case -2: cbrtRem_h.u64 = 0x3fe428a2f0000000; cbrtRem_t.u64 = 0x3e531ae515c447bb; break; + # case -1: cbrtRem_h.u64 = 0x3fe965fea0000000; cbrtRem_t.u64 = 0x3e44f5b8f20ac166; break; + # case 0: cbrtRem_h.u64 = 0x3ff0000000000000; cbrtRem_t.u64 = 0x0000000000000000; break; + # case 1: cbrtRem_h.u64 = 0x3ff428a2f0000000; cbrtRem_t.u64 = 0x3e631ae515c447bb; break; + # case 2: cbrtRem_h.u64 = 0x3ff965fea0000000; cbrtRem_t.u64 = 0x3e54f5b8f20ac166; break; + # default: break; + #} + #cbrtF_h.u64 = CBRT_F_H[index_u64-256]; + #cbrtF_t.u64 = CBRT_F_T[index_u64-256]; + # + #bH = (cbrtF_h.f64 * cbrtRem_h.f64); + #bT = ((((cbrtF_t.f64 * cbrtRem_t.f64)) + (cbrtF_t.f64 * cbrtRem_h.f64)) + (cbrtRem_t.f64 * cbrtF_h.f64)); + lea .L__cuberoot_remainder_h_l(%rip),%r8 # load both head and tail of the remainders cuberoot at once + movapd (%r8,%rcx,8),%xmm1 # xmm1 = [cbrtRem_h.f64,cbrtRem_t.f64] + shl $1,%r10 + lea .L__CBRT_F_H_L_256(%rip),%rax + movapd (%rax,%r10,8),%xmm2 # xmm2 = [cbrtF_h.f64,cbrtF_t.f64] + movapd %xmm2,%xmm3 + psrldq $8,%xmm3 # xmm3 = [~Dont Care,cbrtF_h.f64] + unpcklpd %xmm2,%xmm3 # xmm3 = [cbrtF_t.f64,cbrtF_h.f64] + + mulpd %xmm1,%xmm2 # xmm2 = [(cbrtF_h.f64*cbrtRem_h.f64),(cbrtRem_t.f64*cbrtF_t.f64)] + mulpd %xmm1,%xmm3 # xmm3 = [(cbrtRem_h.f64*cbrtF_t.f64),(cbrtRem_t.f64*cbrtF_h.f64)] + movapd %xmm3,%xmm4 + unpckhpd %xmm4,%xmm4 # xmm4 = [(cbrtRem_h.f64*cbrtF_t.f64),(cbrtRem_h.f64*cbrtF_t.f64)] + addsd %xmm4,%xmm3 # xmm3 = [~Dont Care, ((cbrtRem_h.f64*cbrtF_t.f64) + (cbrtRem_t.f64*cbrtF_h.f64))] + addsd %xmm3,%xmm2 # xmm2 = [(cbrtF_h.f64*cbrtRem_h.f64),(((cbrtRem_t.f64*cbrtF_t.f64)+(cbrtRem_h.f64*cbrtF_t.f64) + (cbrtRem_t.f64*cbrtF_h.f64))] + # xmm2 = [bH,bT] + # Now calculate + #ans.f64 = (((((z * bT)) + (bT)) + (z * bH)) + (bH)); + #ans.f64 = ans.f64 * mf; + #ans.u64 = ans.u64 | sign.u64; + + movapd %xmm2,%xmm3 + unpckhpd %xmm3,%xmm3 # xmm3 = [Dont Care,bH] + # also xmm0 = [z,z] = the polynomial which was computed earlier + mulpd %xmm2,%xmm0 # xmm0 = [(bH*z),(bT*z)] + movapd %xmm0,%xmm4 + unpckhpd %xmm4,%xmm4 # xmm4 = [(bH*z),(bH*z)] + addsd %xmm2,%xmm0 # xmm0 = [~DontCare, ((bT*z) + bT)] + unpckhpd %xmm2,%xmm2 # xmm2 = [(bH),(bH)] + addsd %xmm4,%xmm0 # xmm0 = [~DontCare, (((bT*z) + bT) + ( z*bH))] + addsd %xmm2,%xmm0 # xmm0 = [~DontCare, ((((bT*z) + bT) + (z*bH)) + bH)] = [~Dont Care,ans.f64] + mulsd %xmm7,%xmm0 # xmm0 = ans.f64 * mf; mf is the scaling factor + por %xmm6,%xmm0 # restore the sign + #add $stack_size, %rsp + ret + + +.align 32 +.L__cbrt_is_denormal: + movsd .L__one_mask_64(%rip),%xmm4 + cmp $0,%r9 + jz .L__cbrt_is_zero + pand .L__sign_mask_64(%rip),%xmm0 + por %xmm4,%xmm0 + subsd %xmm4,%xmm0 + movd %xmm0,%rax + mov %rax,%r9 + and %r10,%rax # rax = stores the exponent + and %r11,%r9 # r9 = stores the mantissa + shr $52,%rax + sub $1022,%rax + jmp .L__cbrt_is_normal + +.align 32 +.L__cbrt_is_zero: + ret +.align 32 +.L__cbrt_is_Nan_Infinite: + cmp $0,%r9 + jz .L__cbrt_is_Infinite + mulsd %xmm0,%xmm0 #this multiplication will raise an invalid exception + por .L__qnan_mask_64(%rip),%xmm0 +.L__cbrt_is_Infinite: + #add $stack_size, %rsp + ret + +.align 32 +.L__mantissa_mask_64: .quad 0x000FFFFFFFFFFFFF + .quad 0 #this zero is necessary +.L__qnan_mask_64: .quad 0x0008000000000000 +.L__exp_mask_64: .quad 0x7FF0000000000000 + .quad 0 +.L__zero: .quad 0x0000000000000000 + .quad 0 +.align 32 +.L__zero_point_five: .quad 0x3FE0000000000000 + .quad 0 +.align 16 +.L__sign_mask_64: .quad 0x7FFFFFFFFFFFFFFF + .quad 0 +.L__sign_bit_64: .quad 0x8000000000000000 + .quad 0 +.L__one_mask_64: .quad 0x3FF0000000000000 + .quad 0 +.L__one_by_512: .quad 0x3f60000000000000 + .quad 0 + + +.align 16 +.L__denormal_factor: .quad 0x3F7428A2F98D728B + .quad 0 +# The coeeficients are arranged in a specific order to aid parrallel multiplication +# The numbers corresponding to each coeff corresponds to the rth order to which it is to +# be multiplied +.L__coefficients: +.align 32 +.L__coefficients_5_1: .quad 0x3fd5555555555555 # 1 + .quad 0x3f9ee7113506ac13 # 5 +.L__coefficients_2_4: .quad 0xbfa511e8d2b3183b # 4 + .quad 0xbfbc71c71c71c71c # 2 +.L__coefficients_3_6: .quad 0xbf98090d6221a247 # 6 + .quad 0x3faf9add3c0ca458 # 3 + .quad 0x3f93750ad588f114 # 7 + + + +.align 32 +.L__cuberoot_remainder_h_l: + .quad 0x3e531ae515c447bb # cbrt(2^-2) Low + .quad 0x3FE428A2F0000000 # cbrt(2^-2) High + .quad 0x3e44f5b8f20ac166 # cbrt(2^-1) Low + .quad 0x3FE965FEA0000000 # cbrt(2^-1) High + .quad 0x0000000000000000 # cbrt(2^0) Low + .quad 0x3FF0000000000000 # cbrt(2^0) High + .quad 0x3e631ae515c447bb # cbrt(2^1) Low + .quad 0x3FF428A2F0000000 # cbrt(2^1) High + .quad 0x3e54f5b8f20ac166 # cbrt(2^2) Low + .quad 0x3FF965FEA0000000 # cbrt(2^2) High + + + +#interleaved high and low values +.align 32 +.L__CBRT_F_H_L_256: + .quad 0x0000000000000000 + .quad 0x3ff0000000000000 + .quad 0x3e6e6a24c81e4294 + .quad 0x3ff0055380000000 + .quad 0x3e58548511e3a785 + .quad 0x3ff00aa390000000 + .quad 0x3e64eb9336ec07f6 + .quad 0x3ff00ff010000000 + .quad 0x3e40ea64b8b750e1 + .quad 0x3ff0153920000000 + .quad 0x3e461637cff8a53c + .quad 0x3ff01a7eb0000000 + .quad 0x3e40733bf7bd1943 + .quad 0x3ff01fc0d0000000 + .quad 0x3e5666911345cced + .quad 0x3ff024ff80000000 + .quad 0x3e477b7a3f592f14 + .quad 0x3ff02a3ad0000000 + .quad 0x3e6f18d3dd1a5402 + .quad 0x3ff02f72b0000000 + .quad 0x3e2be2f5a58ee9a4 + .quad 0x3ff034a750000000 + .quad 0x3e68901f8f085fa7 + .quad 0x3ff039d880000000 + .quad 0x3e5c68b8cd5b5d69 + .quad 0x3ff03f0670000000 + .quad 0x3e5a6b0e8624be42 + .quad 0x3ff0443110000000 + .quad 0x3dbc4b22b06f68e7 + .quad 0x3ff0495870000000 + .quad 0x3e60f3f0afcabe9b + .quad 0x3ff04e7c80000000 + .quad 0x3e548495bca4e1b7 + .quad 0x3ff0539d60000000 + .quad 0x3e66107f1abdfdc3 + .quad 0x3ff058bb00000000 + .quad 0x3e6e67261878288a + .quad 0x3ff05dd570000000 + .quad 0x3e5a6bc155286f1e + .quad 0x3ff062ecc0000000 + .quad 0x3e58a759c64a85f2 + .quad 0x3ff06800e0000000 + .quad 0x3e45fce70a4a8d09 + .quad 0x3ff06d11e0000000 + .quad 0x3e32f9cbf373fe1d + .quad 0x3ff0721fc0000000 + .quad 0x3e590564ce4ac359 + .quad 0x3ff0772a80000000 + .quad 0x3e5ac29ce761b02f + .quad 0x3ff07c3230000000 + .quad 0x3e5cb752f497381c + .quad 0x3ff08136d0000000 + .quad 0x3e68bb9e1cfb35e0 + .quad 0x3ff0863860000000 + .quad 0x3e65b4917099de90 + .quad 0x3ff08b36f0000000 + .quad 0x3e5cc77ac9c65ef2 + .quad 0x3ff0903280000000 + .quad 0x3e57a0f3e7be3dba + .quad 0x3ff0952b10000000 + .quad 0x3e66ec851ee0c16f + .quad 0x3ff09a20a0000000 + .quad 0x3e689449bf2946da + .quad 0x3ff09f1340000000 + .quad 0x3e698f25301ba223 + .quad 0x3ff0a402f0000000 + .quad 0x3e347d5ec651f549 + .quad 0x3ff0a8efc0000000 + .quad 0x3e6c33ec9a86007a + .quad 0x3ff0add990000000 + .quad 0x3e5e0b6653e92649 + .quad 0x3ff0b2c090000000 + .quad 0x3e3bd64ac09d755f + .quad 0x3ff0b7a4b0000000 + .quad 0x3e2f537506f78167 + .quad 0x3ff0bc85f0000000 + .quad 0x3e62c382d1b3735e + .quad 0x3ff0c16450000000 + .quad 0x3e6e20ed659f99e1 + .quad 0x3ff0c63fe0000000 + .quad 0x3e586b633a9c182a + .quad 0x3ff0cb18b0000000 + .quad 0x3e445cfd5a65e777 + .quad 0x3ff0cfeeb0000000 + .quad 0x3e60c8770f58bca4 + .quad 0x3ff0d4c1e0000000 + .quad 0x3e6739e44b0933c5 + .quad 0x3ff0d99250000000 + .quad 0x3e027dc3d9ce7bd8 + .quad 0x3ff0de6010000000 + .quad 0x3e63c53c7c5a7b64 + .quad 0x3ff0e32b00000000 + .quad 0x3e69669683830cec + .quad 0x3ff0e7f340000000 + .quad 0x3e68d772c39bdcc4 + .quad 0x3ff0ecb8d0000000 + .quad 0x3e69b0008bcf6d7b + .quad 0x3ff0f17bb0000000 + .quad 0x3e3bbb305825ce4f + .quad 0x3ff0f63bf0000000 + .quad 0x3e6da3f4af13a406 + .quad 0x3ff0faf970000000 + .quad 0x3e5f36b96f74ce86 + .quad 0x3ff0ffb460000000 + .quad 0x3e165c002303f790 + .quad 0x3ff1046cb0000000 + .quad 0x3e682f84095ba7d5 + .quad 0x3ff1092250000000 + .quad 0x3e6d46433541b2c6 + .quad 0x3ff10dd560000000 + .quad 0x3e671c3d56e93a89 + .quad 0x3ff11285e0000000 + .quad 0x3e598dcef4e40012 + .quad 0x3ff11733d0000000 + .quad 0x3e4530ebef17fe03 + .quad 0x3ff11bdf30000000 + .quad 0x3e4e8b8fa3715066 + .quad 0x3ff1208800000000 + .quad 0x3e6ab26eb3b211dc + .quad 0x3ff1252e40000000 + .quad 0x3e454dd4dc906307 + .quad 0x3ff129d210000000 + .quad 0x3e5c9f962387984e + .quad 0x3ff12e7350000000 + .quad 0x3e6c62a959afec09 + .quad 0x3ff1331210000000 + .quad 0x3e6638d9ac6a866a + .quad 0x3ff137ae60000000 + .quad 0x3e338704eca8a22d + .quad 0x3ff13c4840000000 + .quad 0x3e4e6c9e1db14f8f + .quad 0x3ff140dfa0000000 + .quad 0x3e58744b7f9c9eaa + .quad 0x3ff1457490000000 + .quad 0x3e66c2893486373b + .quad 0x3ff14a0710000000 + .quad 0x3e5b36bce31699b7 + .quad 0x3ff14e9730000000 + .quad 0x3e671e3813d200c7 + .quad 0x3ff15324e0000000 + .quad 0x3e699755ab40aa88 + .quad 0x3ff157b030000000 + .quad 0x3e6b45ca0e4bcfc0 + .quad 0x3ff15c3920000000 + .quad 0x3e32dd090d869c5d + .quad 0x3ff160bfc0000000 + .quad 0x3e64fe0516b917da + .quad 0x3ff16543f0000000 + .quad 0x3e694563226317a2 + .quad 0x3ff169c5d0000000 + .quad 0x3e653d8fafc2c851 + .quad 0x3ff16e4560000000 + .quad 0x3e5dcbd41fbd41a3 + .quad 0x3ff172c2a0000000 + .quad 0x3e5862ff5285f59c + .quad 0x3ff1773d90000000 + .quad 0x3e63072ea97a1e1c + .quad 0x3ff17bb630000000 + .quad 0x3e52839075184805 + .quad 0x3ff1802c90000000 + .quad 0x3e64b0323e9eff42 + .quad 0x3ff184a0a0000000 + .quad 0x3e6b158893c45484 + .quad 0x3ff1891270000000 + .quad 0x3e3149ef0fc35826 + .quad 0x3ff18d8210000000 + .quad 0x3e5f2e77ea96acaa + .quad 0x3ff191ef60000000 + .quad 0x3e5200074c471a95 + .quad 0x3ff1965a80000000 + .quad 0x3e63f8cc517f6f04 + .quad 0x3ff19ac360000000 + .quad 0x3e660ba2e311bb55 + .quad 0x3ff19f2a10000000 + .quad 0x3e64b788730bbec3 + .quad 0x3ff1a38e90000000 + .quad 0x3e657090795ee20c + .quad 0x3ff1a7f0e0000000 + .quad 0x3e6d9ffe983670b1 + .quad 0x3ff1ac5100000000 + .quad 0x3e62a463ff61bfda + .quad 0x3ff1b0af00000000 + .quad 0x3e69d1bc6a5e65cf + .quad 0x3ff1b50ad0000000 + .quad 0x3e68718abaa9e922 + .quad 0x3ff1b96480000000 + .quad 0x3e63c2f52ffa342e + .quad 0x3ff1bdbc10000000 + .quad 0x3e60fae13ff42c80 + .quad 0x3ff1c21180000000 + .quad 0x3e65440f0ef00d57 + .quad 0x3ff1c664d0000000 + .quad 0x3e46fcd22d4e3c1e + .quad 0x3ff1cab610000000 + .quad 0x3e4e0c60b409e863 + .quad 0x3ff1cf0530000000 + .quad 0x3e6f9cab5a5f0333 + .quad 0x3ff1d35230000000 + .quad 0x3e630f24744c333d + .quad 0x3ff1d79d30000000 + .quad 0x3e4b50622a76b2fe + .quad 0x3ff1dbe620000000 + .quad 0x3e6fdb94ba595375 + .quad 0x3ff1e02cf0000000 + .quad 0x3e3861b9b945a171 + .quad 0x3ff1e471d0000000 + .quad 0x3e654348015188c4 + .quad 0x3ff1e8b490000000 + .quad 0x3e6b54d149865523 + .quad 0x3ff1ecf550000000 + .quad 0x3e6a0bb783d9de33 + .quad 0x3ff1f13410000000 + .quad 0x3e6629d12b1a2157 + .quad 0x3ff1f570d0000000 + .quad 0x3e6467fe35d179df + .quad 0x3ff1f9ab90000000 + .quad 0x3e69763f3e26c8f7 + .quad 0x3ff1fde450000000 + .quad 0x3e53f798bb9f7679 + .quad 0x3ff2021b20000000 + .quad 0x3e552e577e855898 + .quad 0x3ff2064ff0000000 + .quad 0x3e6fde47e5502c3a + .quad 0x3ff20a82c0000000 + .quad 0x3e5cbd0b548d96a0 + .quad 0x3ff20eb3b0000000 + .quad 0x3e6a9cd9f7be8de8 + .quad 0x3ff212e2a0000000 + .quad 0x3e522bbe704886de + .quad 0x3ff2170fb0000000 + .quad 0x3e6e3dea8317f020 + .quad 0x3ff21b3ac0000000 + .quad 0x3e6e812085ac8855 + .quad 0x3ff21f63f0000000 + .quad 0x3e5c87144f24cb07 + .quad 0x3ff2238b40000000 + .quad 0x3e61e128ee311fa2 + .quad 0x3ff227b0a0000000 + .quad 0x3e5b5c163d61a2d3 + .quad 0x3ff22bd420000000 + .quad 0x3e47d97e7fb90633 + .quad 0x3ff22ff5c0000000 + .quad 0x3e6efe899d50f6a7 + .quad 0x3ff2341570000000 + .quad 0x3e6d0333eb75de5a + .quad 0x3ff2383350000000 + .quad 0x3e40e590be73a573 + .quad 0x3ff23c4f60000000 + .quad 0x3e68ce8dcac3cdd2 + .quad 0x3ff2406980000000 + .quad 0x3e6ee8a48954064b + .quad 0x3ff24481d0000000 + .quad 0x3e6aa62f18461e09 + .quad 0x3ff2489850000000 + .quad 0x3e601e5940986a15 + .quad 0x3ff24cad00000000 + .quad 0x3e3b082f4f9b8d4c + .quad 0x3ff250bfe0000000 + .quad 0x3e6876e0e5527f5a + .quad 0x3ff254d0e0000000 + .quad 0x3e63617080831e6b + .quad 0x3ff258e020000000 + .quad 0x3e681b26e34aa4a2 + .quad 0x3ff25ced90000000 + .quad 0x3e552ee66dfab0c1 + .quad 0x3ff260f940000000 + .quad 0x3e5d85a5329e8819 + .quad 0x3ff2650320000000 + .quad 0x3e5105c1b646b5d1 + .quad 0x3ff2690b40000000 + .quad 0x3e6bb6690c1a379c + .quad 0x3ff26d1190000000 + .quad 0x3e586aeba73ce3a9 + .quad 0x3ff2711630000000 + .quad 0x3e6dd16198294dd4 + .quad 0x3ff2751900000000 + .quad 0x3e6454e675775e83 + .quad 0x3ff2791a20000000 + .quad 0x3e63842e026197ea + .quad 0x3ff27d1980000000 + .quad 0x3e6f1ce0e70c44d2 + .quad 0x3ff2811720000000 + .quad 0x3e6ad636441a5627 + .quad 0x3ff2851310000000 + .quad 0x3e54c205d7212abb + .quad 0x3ff2890d50000000 + .quad 0x3e6167c86c116419 + .quad 0x3ff28d05d0000000 + .quad 0x3e638ec3ef16e294 + .quad 0x3ff290fca0000000 + .quad 0x3e6473fceace9321 + .quad 0x3ff294f1c0000000 + .quad 0x3e67af53a836dba7 + .quad 0x3ff298e530000000 + .quad 0x3e1a51f3c383b652 + .quad 0x3ff29cd700000000 + .quad 0x3e63696da190822d + .quad 0x3ff2a0c710000000 + .quad 0x3e62f9adec77074b + .quad 0x3ff2a4b580000000 + .quad 0x3e38190fd5bee55f + .quad 0x3ff2a8a250000000 + .quad 0x3e4bfee8fac68e55 + .quad 0x3ff2ac8d70000000 + .quad 0x3e331c9d6bc5f68a + .quad 0x3ff2b076f0000000 + .quad 0x3e689d0523737edf + .quad 0x3ff2b45ec0000000 + .quad 0x3e5a295943bf47bb + .quad 0x3ff2b84500000000 + .quad 0x3e396be32e5b3207 + .quad 0x3ff2bc29a0000000 + .quad 0x3e6e44c7d909fa0e + .quad 0x3ff2c00c90000000 + .quad 0x3e2b2505da94d9ea + .quad 0x3ff2c3ee00000000 + .quad 0x3e60c851f46c9c98 + .quad 0x3ff2c7cdc0000000 + .quad 0x3e5da71f7d9aa3b7 + .quad 0x3ff2cbabf0000000 + .quad 0x3e6f1b605d019ef1 + .quad 0x3ff2cf8880000000 + .quad 0x3e4386e8a2189563 + .quad 0x3ff2d36390000000 + .quad 0x3e3b19fa5d306ba7 + .quad 0x3ff2d73d00000000 + .quad 0x3e6dd749b67aef76 + .quad 0x3ff2db14d0000000 + .quad 0x3e676ff6f1dc04b0 + .quad 0x3ff2deeb20000000 + .quad 0x3e635a33d0b232a6 + .quad 0x3ff2e2bfe0000000 + .quad 0x3e64bdc80024a4e1 + .quad 0x3ff2e69310000000 + .quad 0x3e6ebd61770fd723 + .quad 0x3ff2ea64b0000000 + .quad 0x3e64769fc537264d + .quad 0x3ff2ee34d0000000 + .quad 0x3e69021f429f3b98 + .quad 0x3ff2f20360000000 + .quad 0x3e5ee7083efbd606 + .quad 0x3ff2f5d070000000 + .quad 0x3e6ad985552a6b1a + .quad 0x3ff2f99bf0000000 + .quad 0x3e6e3df778772160 + .quad 0x3ff2fd65f0000000 + .quad 0x3e6ca5d76ddc9b34 + .quad 0x3ff3012e70000000 + .quad 0x3e691154ffdbaf74 + .quad 0x3ff304f570000000 + .quad 0x3e667bdd57fb306a + .quad 0x3ff308baf0000000 + .quad 0x3e67dc255ac40886 + .quad 0x3ff30c7ef0000000 + .quad 0x3df219f38e8afafe + .quad 0x3ff3104180000000 + .quad 0x3e62416bf9669a04 + .quad 0x3ff3140280000000 + .quad 0x3e611c96b2b3987f + .quad 0x3ff317c210000000 + .quad 0x3e6f99ed447e1177 + .quad 0x3ff31b8020000000 + .quad 0x3e13245826328a11 + .quad 0x3ff31f3cd0000000 + .quad 0x3e66f56dd1e645f8 + .quad 0x3ff322f7f0000000 + .quad 0x3e46164946945535 + .quad 0x3ff326b1b0000000 + .quad 0x3e5e37d59d190028 + .quad 0x3ff32a69f0000000 + .quad 0x3e668671f12bf828 + .quad 0x3ff32e20c0000000 + .quad 0x3e6e8ecbca6aabbd + .quad 0x3ff331d620000000 + .quad 0x3e53f49e109a5912 + .quad 0x3ff3358a20000000 + .quad 0x3e6b8a0e11ec3043 + .quad 0x3ff3393ca0000000 + .quad 0x3e65fae00aed691a + .quad 0x3ff33cedc0000000 + .quad 0x3e6c0569bece3e4a + .quad 0x3ff3409d70000000 + .quad 0x3e605e26744efbfe + .quad 0x3ff3444bc0000000 + .quad 0x3e65b570a94be5c5 + .quad 0x3ff347f8a0000000 + .quad 0x3e5d6f156ea0e063 + .quad 0x3ff34ba420000000 + .quad 0x3e6e0ca7612fc484 + .quad 0x3ff34f4e30000000 + .quad 0x3e4963c927b25258 + .quad 0x3ff352f6f0000000 + .quad 0x3e547930aa725a5c + .quad 0x3ff3569e40000000 + .quad 0x3e58a79fe3af43b3 + .quad 0x3ff35a4430000000 + .quad 0x3e5e6dc29c41bdaf + .quad 0x3ff35de8c0000000 + .quad 0x3e657a2e76f863a5 + .quad 0x3ff3618bf0000000 + .quad 0x3e2ae3b61716354d + .quad 0x3ff3652dd0000000 + .quad 0x3e665fb5df6906b1 + .quad 0x3ff368ce40000000 + .quad 0x3e66177d7f588f7b + .quad 0x3ff36c6d60000000 + .quad 0x3e3ad55abd091b67 + .quad 0x3ff3700b30000000 + .quad 0x3e155337b2422d76 + .quad 0x3ff373a7a0000000 + .quad 0x3e6084ebe86972d5 + .quad 0x3ff37742b0000000 + .quad 0x3e656395808e1ea3 + .quad 0x3ff37adc70000000 + .quad 0x3e61bce21b40fba7 + .quad 0x3ff37e74e0000000 + .quad 0x3e5006f94605b515 + .quad 0x3ff3820c00000000 + .quad 0x3e6aa676aceb1f7d + .quad 0x3ff385a1c0000000 + .quad 0x3e58229f76554ce6 + .quad 0x3ff3893640000000 + .quad 0x3e6eabfc6cf57330 + .quad 0x3ff38cc960000000 + .quad 0x3e64daed9c0ce8bc + .quad 0x3ff3905b40000000 + .quad 0x3e60ff1768237141 + .quad 0x3ff393ebd0000000 + .quad 0x3e6575f83051b085 + .quad 0x3ff3977b10000000 + .quad 0x3e42667deb523e29 + .quad 0x3ff39b0910000000 + .quad 0x3e1816996954f4fd + .quad 0x3ff39e95c0000000 + .quad 0x3e587cfccf4d9cd4 + .quad 0x3ff3a22120000000 + .quad 0x3e52c5d018198353 + .quad 0x3ff3a5ab40000000 + .quad 0x3e6a7a898dcc34aa + .quad 0x3ff3a93410000000 + .quad 0x3e2cead6dadc36d1 + .quad 0x3ff3acbbb0000000 + .quad 0x3e2a55759c498bdf + .quad 0x3ff3b04200000000 + .quad 0x3e6c414a9ef6de04 + .quad 0x3ff3b3c700000000 + .quad 0x3e63e2108a6e58fa + .quad 0x3ff3b74ad0000000 + .quad 0x3e5587fd7643d77c + .quad 0x3ff3bacd60000000 + .quad 0x3e3901eb1d3ff3df + .quad 0x3ff3be4eb0000000 + .quad 0x3e6f2ccd7c812fc6 + .quad 0x3ff3c1ceb0000000 + .quad 0x3e21c8ee70a01049 + .quad 0x3ff3c54d90000000 + .quad 0x3e563e8d02831eec + .quad 0x3ff3c8cb20000000 + .quad 0x3e6f61a42a92c7ff + .quad 0x3ff3cc4770000000 + .quad 0x3dda917399c84d24 + .quad 0x3ff3cfc2a0000000 + .quad 0x3e5e9197c8eec2f0 + .quad 0x3ff3d33c80000000 + .quad 0x3e5e6f842f5a1378 + .quad 0x3ff3d6b530000000 + .quad 0x3e2fac242a90a0fc + .quad 0x3ff3da2cb0000000 + .quad 0x3e535ed726610227 + .quad 0x3ff3dda2f0000000 + .quad 0x3e50e0d64804b15b + .quad 0x3ff3e11800000000 + .quad 0x3e0560675daba814 + .quad 0x3ff3e48be0000000 + .quad 0x3e637388c8768032 + .quad 0x3ff3e7fe80000000 + .quad 0x3e3ee3c89f9e01f5 + .quad 0x3ff3eb7000000000 + .quad 0x3e639f6f0d09747c + .quad 0x3ff3eee040000000 + .quad 0x3e4322c327abb8f0 + .quad 0x3ff3f24f60000000 + .quad 0x3e6961b347c8ac80 + .quad 0x3ff3f5bd40000000 + .quad 0x3e63711fbbd0f118 + .quad 0x3ff3f92a00000000 + .quad 0x3e64fad8d7718ffb + .quad 0x3ff3fc9590000000 + .quad 0x3e6fffffffffffff + .quad 0x3ff3fffff0000000 + .quad 0x3e667efa79ec35b4 + .quad 0x3ff4036930000000 + .quad 0x3e6a737687a254a8 + .quad 0x3ff406d140000000 + .quad 0x3e5bace0f87d924d + .quad 0x3ff40a3830000000 + .quad 0x3e629e37c237e392 + .quad 0x3ff40d9df0000000 + .quad 0x3e557ce7ac3f3012 + .quad 0x3ff4110290000000 + .quad 0x3e682829359f8fbd + .quad 0x3ff4146600000000 + .quad 0x3e6cc9be42d14676 + .quad 0x3ff417c850000000 + .quad 0x3e6a8f001c137d0b + .quad 0x3ff41b2980000000 + .quad 0x3e636127687dda05 + .quad 0x3ff41e8990000000 + .quad 0x3e524dba322646f0 + .quad 0x3ff421e880000000 + .quad 0x3e6dc43f1ed210b4 + .quad 0x3ff4254640000000 + .quad 0x3e631ae515c447bb + .quad 0x3ff428a2f0000000 + + +.align 32 +.L__CBRT_F_H_256: .quad 0x3ff0000000000000 + .quad 0x3ff0055380000000 + .quad 0x3ff00aa390000000 + .quad 0x3ff00ff010000000 + .quad 0x3ff0153920000000 + .quad 0x3ff01a7eb0000000 + .quad 0x3ff01fc0d0000000 + .quad 0x3ff024ff80000000 + .quad 0x3ff02a3ad0000000 + .quad 0x3ff02f72b0000000 + .quad 0x3ff034a750000000 + .quad 0x3ff039d880000000 + .quad 0x3ff03f0670000000 + .quad 0x3ff0443110000000 + .quad 0x3ff0495870000000 + .quad 0x3ff04e7c80000000 + .quad 0x3ff0539d60000000 + .quad 0x3ff058bb00000000 + .quad 0x3ff05dd570000000 + .quad 0x3ff062ecc0000000 + .quad 0x3ff06800e0000000 + .quad 0x3ff06d11e0000000 + .quad 0x3ff0721fc0000000 + .quad 0x3ff0772a80000000 + .quad 0x3ff07c3230000000 + .quad 0x3ff08136d0000000 + .quad 0x3ff0863860000000 + .quad 0x3ff08b36f0000000 + .quad 0x3ff0903280000000 + .quad 0x3ff0952b10000000 + .quad 0x3ff09a20a0000000 + .quad 0x3ff09f1340000000 + .quad 0x3ff0a402f0000000 + .quad 0x3ff0a8efc0000000 + .quad 0x3ff0add990000000 + .quad 0x3ff0b2c090000000 + .quad 0x3ff0b7a4b0000000 + .quad 0x3ff0bc85f0000000 + .quad 0x3ff0c16450000000 + .quad 0x3ff0c63fe0000000 + .quad 0x3ff0cb18b0000000 + .quad 0x3ff0cfeeb0000000 + .quad 0x3ff0d4c1e0000000 + .quad 0x3ff0d99250000000 + .quad 0x3ff0de6010000000 + .quad 0x3ff0e32b00000000 + .quad 0x3ff0e7f340000000 + .quad 0x3ff0ecb8d0000000 + .quad 0x3ff0f17bb0000000 + .quad 0x3ff0f63bf0000000 + .quad 0x3ff0faf970000000 + .quad 0x3ff0ffb460000000 + .quad 0x3ff1046cb0000000 + .quad 0x3ff1092250000000 + .quad 0x3ff10dd560000000 + .quad 0x3ff11285e0000000 + .quad 0x3ff11733d0000000 + .quad 0x3ff11bdf30000000 + .quad 0x3ff1208800000000 + .quad 0x3ff1252e40000000 + .quad 0x3ff129d210000000 + .quad 0x3ff12e7350000000 + .quad 0x3ff1331210000000 + .quad 0x3ff137ae60000000 + .quad 0x3ff13c4840000000 + .quad 0x3ff140dfa0000000 + .quad 0x3ff1457490000000 + .quad 0x3ff14a0710000000 + .quad 0x3ff14e9730000000 + .quad 0x3ff15324e0000000 + .quad 0x3ff157b030000000 + .quad 0x3ff15c3920000000 + .quad 0x3ff160bfc0000000 + .quad 0x3ff16543f0000000 + .quad 0x3ff169c5d0000000 + .quad 0x3ff16e4560000000 + .quad 0x3ff172c2a0000000 + .quad 0x3ff1773d90000000 + .quad 0x3ff17bb630000000 + .quad 0x3ff1802c90000000 + .quad 0x3ff184a0a0000000 + .quad 0x3ff1891270000000 + .quad 0x3ff18d8210000000 + .quad 0x3ff191ef60000000 + .quad 0x3ff1965a80000000 + .quad 0x3ff19ac360000000 + .quad 0x3ff19f2a10000000 + .quad 0x3ff1a38e90000000 + .quad 0x3ff1a7f0e0000000 + .quad 0x3ff1ac5100000000 + .quad 0x3ff1b0af00000000 + .quad 0x3ff1b50ad0000000 + .quad 0x3ff1b96480000000 + .quad 0x3ff1bdbc10000000 + .quad 0x3ff1c21180000000 + .quad 0x3ff1c664d0000000 + .quad 0x3ff1cab610000000 + .quad 0x3ff1cf0530000000 + .quad 0x3ff1d35230000000 + .quad 0x3ff1d79d30000000 + .quad 0x3ff1dbe620000000 + .quad 0x3ff1e02cf0000000 + .quad 0x3ff1e471d0000000 + .quad 0x3ff1e8b490000000 + .quad 0x3ff1ecf550000000 + .quad 0x3ff1f13410000000 + .quad 0x3ff1f570d0000000 + .quad 0x3ff1f9ab90000000 + .quad 0x3ff1fde450000000 + .quad 0x3ff2021b20000000 + .quad 0x3ff2064ff0000000 + .quad 0x3ff20a82c0000000 + .quad 0x3ff20eb3b0000000 + .quad 0x3ff212e2a0000000 + .quad 0x3ff2170fb0000000 + .quad 0x3ff21b3ac0000000 + .quad 0x3ff21f63f0000000 + .quad 0x3ff2238b40000000 + .quad 0x3ff227b0a0000000 + .quad 0x3ff22bd420000000 + .quad 0x3ff22ff5c0000000 + .quad 0x3ff2341570000000 + .quad 0x3ff2383350000000 + .quad 0x3ff23c4f60000000 + .quad 0x3ff2406980000000 + .quad 0x3ff24481d0000000 + .quad 0x3ff2489850000000 + .quad 0x3ff24cad00000000 + .quad 0x3ff250bfe0000000 + .quad 0x3ff254d0e0000000 + .quad 0x3ff258e020000000 + .quad 0x3ff25ced90000000 + .quad 0x3ff260f940000000 + .quad 0x3ff2650320000000 + .quad 0x3ff2690b40000000 + .quad 0x3ff26d1190000000 + .quad 0x3ff2711630000000 + .quad 0x3ff2751900000000 + .quad 0x3ff2791a20000000 + .quad 0x3ff27d1980000000 + .quad 0x3ff2811720000000 + .quad 0x3ff2851310000000 + .quad 0x3ff2890d50000000 + .quad 0x3ff28d05d0000000 + .quad 0x3ff290fca0000000 + .quad 0x3ff294f1c0000000 + .quad 0x3ff298e530000000 + .quad 0x3ff29cd700000000 + .quad 0x3ff2a0c710000000 + .quad 0x3ff2a4b580000000 + .quad 0x3ff2a8a250000000 + .quad 0x3ff2ac8d70000000 + .quad 0x3ff2b076f0000000 + .quad 0x3ff2b45ec0000000 + .quad 0x3ff2b84500000000 + .quad 0x3ff2bc29a0000000 + .quad 0x3ff2c00c90000000 + .quad 0x3ff2c3ee00000000 + .quad 0x3ff2c7cdc0000000 + .quad 0x3ff2cbabf0000000 + .quad 0x3ff2cf8880000000 + .quad 0x3ff2d36390000000 + .quad 0x3ff2d73d00000000 + .quad 0x3ff2db14d0000000 + .quad 0x3ff2deeb20000000 + .quad 0x3ff2e2bfe0000000 + .quad 0x3ff2e69310000000 + .quad 0x3ff2ea64b0000000 + .quad 0x3ff2ee34d0000000 + .quad 0x3ff2f20360000000 + .quad 0x3ff2f5d070000000 + .quad 0x3ff2f99bf0000000 + .quad 0x3ff2fd65f0000000 + .quad 0x3ff3012e70000000 + .quad 0x3ff304f570000000 + .quad 0x3ff308baf0000000 + .quad 0x3ff30c7ef0000000 + .quad 0x3ff3104180000000 + .quad 0x3ff3140280000000 + .quad 0x3ff317c210000000 + .quad 0x3ff31b8020000000 + .quad 0x3ff31f3cd0000000 + .quad 0x3ff322f7f0000000 + .quad 0x3ff326b1b0000000 + .quad 0x3ff32a69f0000000 + .quad 0x3ff32e20c0000000 + .quad 0x3ff331d620000000 + .quad 0x3ff3358a20000000 + .quad 0x3ff3393ca0000000 + .quad 0x3ff33cedc0000000 + .quad 0x3ff3409d70000000 + .quad 0x3ff3444bc0000000 + .quad 0x3ff347f8a0000000 + .quad 0x3ff34ba420000000 + .quad 0x3ff34f4e30000000 + .quad 0x3ff352f6f0000000 + .quad 0x3ff3569e40000000 + .quad 0x3ff35a4430000000 + .quad 0x3ff35de8c0000000 + .quad 0x3ff3618bf0000000 + .quad 0x3ff3652dd0000000 + .quad 0x3ff368ce40000000 + .quad 0x3ff36c6d60000000 + .quad 0x3ff3700b30000000 + .quad 0x3ff373a7a0000000 + .quad 0x3ff37742b0000000 + .quad 0x3ff37adc70000000 + .quad 0x3ff37e74e0000000 + .quad 0x3ff3820c00000000 + .quad 0x3ff385a1c0000000 + .quad 0x3ff3893640000000 + .quad 0x3ff38cc960000000 + .quad 0x3ff3905b40000000 + .quad 0x3ff393ebd0000000 + .quad 0x3ff3977b10000000 + .quad 0x3ff39b0910000000 + .quad 0x3ff39e95c0000000 + .quad 0x3ff3a22120000000 + .quad 0x3ff3a5ab40000000 + .quad 0x3ff3a93410000000 + .quad 0x3ff3acbbb0000000 + .quad 0x3ff3b04200000000 + .quad 0x3ff3b3c700000000 + .quad 0x3ff3b74ad0000000 + .quad 0x3ff3bacd60000000 + .quad 0x3ff3be4eb0000000 + .quad 0x3ff3c1ceb0000000 + .quad 0x3ff3c54d90000000 + .quad 0x3ff3c8cb20000000 + .quad 0x3ff3cc4770000000 + .quad 0x3ff3cfc2a0000000 + .quad 0x3ff3d33c80000000 + .quad 0x3ff3d6b530000000 + .quad 0x3ff3da2cb0000000 + .quad 0x3ff3dda2f0000000 + .quad 0x3ff3e11800000000 + .quad 0x3ff3e48be0000000 + .quad 0x3ff3e7fe80000000 + .quad 0x3ff3eb7000000000 + .quad 0x3ff3eee040000000 + .quad 0x3ff3f24f60000000 + .quad 0x3ff3f5bd40000000 + .quad 0x3ff3f92a00000000 + .quad 0x3ff3fc9590000000 + .quad 0x3ff3fffff0000000 + .quad 0x3ff4036930000000 + .quad 0x3ff406d140000000 + .quad 0x3ff40a3830000000 + .quad 0x3ff40d9df0000000 + .quad 0x3ff4110290000000 + .quad 0x3ff4146600000000 + .quad 0x3ff417c850000000 + .quad 0x3ff41b2980000000 + .quad 0x3ff41e8990000000 + .quad 0x3ff421e880000000 + .quad 0x3ff4254640000000 + +.align 32 +.L__CBRT_F_T_256: .quad 0x0000000000000000 + .quad 0x3e6e6a24c81e4294 + .quad 0x3e58548511e3a785 + .quad 0x3e64eb9336ec07f6 + .quad 0x3e40ea64b8b750e1 + .quad 0x3e461637cff8a53c + .quad 0x3e40733bf7bd1943 + .quad 0x3e5666911345cced + .quad 0x3e477b7a3f592f14 + .quad 0x3e6f18d3dd1a5402 + .quad 0x3e2be2f5a58ee9a4 + .quad 0x3e68901f8f085fa7 + .quad 0x3e5c68b8cd5b5d69 + .quad 0x3e5a6b0e8624be42 + .quad 0x3dbc4b22b06f68e7 + .quad 0x3e60f3f0afcabe9b + .quad 0x3e548495bca4e1b7 + .quad 0x3e66107f1abdfdc3 + .quad 0x3e6e67261878288a + .quad 0x3e5a6bc155286f1e + .quad 0x3e58a759c64a85f2 + .quad 0x3e45fce70a4a8d09 + .quad 0x3e32f9cbf373fe1d + .quad 0x3e590564ce4ac359 + .quad 0x3e5ac29ce761b02f + .quad 0x3e5cb752f497381c + .quad 0x3e68bb9e1cfb35e0 + .quad 0x3e65b4917099de90 + .quad 0x3e5cc77ac9c65ef2 + .quad 0x3e57a0f3e7be3dba + .quad 0x3e66ec851ee0c16f + .quad 0x3e689449bf2946da + .quad 0x3e698f25301ba223 + .quad 0x3e347d5ec651f549 + .quad 0x3e6c33ec9a86007a + .quad 0x3e5e0b6653e92649 + .quad 0x3e3bd64ac09d755f + .quad 0x3e2f537506f78167 + .quad 0x3e62c382d1b3735e + .quad 0x3e6e20ed659f99e1 + .quad 0x3e586b633a9c182a + .quad 0x3e445cfd5a65e777 + .quad 0x3e60c8770f58bca4 + .quad 0x3e6739e44b0933c5 + .quad 0x3e027dc3d9ce7bd8 + .quad 0x3e63c53c7c5a7b64 + .quad 0x3e69669683830cec + .quad 0x3e68d772c39bdcc4 + .quad 0x3e69b0008bcf6d7b + .quad 0x3e3bbb305825ce4f + .quad 0x3e6da3f4af13a406 + .quad 0x3e5f36b96f74ce86 + .quad 0x3e165c002303f790 + .quad 0x3e682f84095ba7d5 + .quad 0x3e6d46433541b2c6 + .quad 0x3e671c3d56e93a89 + .quad 0x3e598dcef4e40012 + .quad 0x3e4530ebef17fe03 + .quad 0x3e4e8b8fa3715066 + .quad 0x3e6ab26eb3b211dc + .quad 0x3e454dd4dc906307 + .quad 0x3e5c9f962387984e + .quad 0x3e6c62a959afec09 + .quad 0x3e6638d9ac6a866a + .quad 0x3e338704eca8a22d + .quad 0x3e4e6c9e1db14f8f + .quad 0x3e58744b7f9c9eaa + .quad 0x3e66c2893486373b + .quad 0x3e5b36bce31699b7 + .quad 0x3e671e3813d200c7 + .quad 0x3e699755ab40aa88 + .quad 0x3e6b45ca0e4bcfc0 + .quad 0x3e32dd090d869c5d + .quad 0x3e64fe0516b917da + .quad 0x3e694563226317a2 + .quad 0x3e653d8fafc2c851 + .quad 0x3e5dcbd41fbd41a3 + .quad 0x3e5862ff5285f59c + .quad 0x3e63072ea97a1e1c + .quad 0x3e52839075184805 + .quad 0x3e64b0323e9eff42 + .quad 0x3e6b158893c45484 + .quad 0x3e3149ef0fc35826 + .quad 0x3e5f2e77ea96acaa + .quad 0x3e5200074c471a95 + .quad 0x3e63f8cc517f6f04 + .quad 0x3e660ba2e311bb55 + .quad 0x3e64b788730bbec3 + .quad 0x3e657090795ee20c + .quad 0x3e6d9ffe983670b1 + .quad 0x3e62a463ff61bfda + .quad 0x3e69d1bc6a5e65cf + .quad 0x3e68718abaa9e922 + .quad 0x3e63c2f52ffa342e + .quad 0x3e60fae13ff42c80 + .quad 0x3e65440f0ef00d57 + .quad 0x3e46fcd22d4e3c1e + .quad 0x3e4e0c60b409e863 + .quad 0x3e6f9cab5a5f0333 + .quad 0x3e630f24744c333d + .quad 0x3e4b50622a76b2fe + .quad 0x3e6fdb94ba595375 + .quad 0x3e3861b9b945a171 + .quad 0x3e654348015188c4 + .quad 0x3e6b54d149865523 + .quad 0x3e6a0bb783d9de33 + .quad 0x3e6629d12b1a2157 + .quad 0x3e6467fe35d179df + .quad 0x3e69763f3e26c8f7 + .quad 0x3e53f798bb9f7679 + .quad 0x3e552e577e855898 + .quad 0x3e6fde47e5502c3a + .quad 0x3e5cbd0b548d96a0 + .quad 0x3e6a9cd9f7be8de8 + .quad 0x3e522bbe704886de + .quad 0x3e6e3dea8317f020 + .quad 0x3e6e812085ac8855 + .quad 0x3e5c87144f24cb07 + .quad 0x3e61e128ee311fa2 + .quad 0x3e5b5c163d61a2d3 + .quad 0x3e47d97e7fb90633 + .quad 0x3e6efe899d50f6a7 + .quad 0x3e6d0333eb75de5a + .quad 0x3e40e590be73a573 + .quad 0x3e68ce8dcac3cdd2 + .quad 0x3e6ee8a48954064b + .quad 0x3e6aa62f18461e09 + .quad 0x3e601e5940986a15 + .quad 0x3e3b082f4f9b8d4c + .quad 0x3e6876e0e5527f5a + .quad 0x3e63617080831e6b + .quad 0x3e681b26e34aa4a2 + .quad 0x3e552ee66dfab0c1 + .quad 0x3e5d85a5329e8819 + .quad 0x3e5105c1b646b5d1 + .quad 0x3e6bb6690c1a379c + .quad 0x3e586aeba73ce3a9 + .quad 0x3e6dd16198294dd4 + .quad 0x3e6454e675775e83 + .quad 0x3e63842e026197ea + .quad 0x3e6f1ce0e70c44d2 + .quad 0x3e6ad636441a5627 + .quad 0x3e54c205d7212abb + .quad 0x3e6167c86c116419 + .quad 0x3e638ec3ef16e294 + .quad 0x3e6473fceace9321 + .quad 0x3e67af53a836dba7 + .quad 0x3e1a51f3c383b652 + .quad 0x3e63696da190822d + .quad 0x3e62f9adec77074b + .quad 0x3e38190fd5bee55f + .quad 0x3e4bfee8fac68e55 + .quad 0x3e331c9d6bc5f68a + .quad 0x3e689d0523737edf + .quad 0x3e5a295943bf47bb + .quad 0x3e396be32e5b3207 + .quad 0x3e6e44c7d909fa0e + .quad 0x3e2b2505da94d9ea + .quad 0x3e60c851f46c9c98 + .quad 0x3e5da71f7d9aa3b7 + .quad 0x3e6f1b605d019ef1 + .quad 0x3e4386e8a2189563 + .quad 0x3e3b19fa5d306ba7 + .quad 0x3e6dd749b67aef76 + .quad 0x3e676ff6f1dc04b0 + .quad 0x3e635a33d0b232a6 + .quad 0x3e64bdc80024a4e1 + .quad 0x3e6ebd61770fd723 + .quad 0x3e64769fc537264d + .quad 0x3e69021f429f3b98 + .quad 0x3e5ee7083efbd606 + .quad 0x3e6ad985552a6b1a + .quad 0x3e6e3df778772160 + .quad 0x3e6ca5d76ddc9b34 + .quad 0x3e691154ffdbaf74 + .quad 0x3e667bdd57fb306a + .quad 0x3e67dc255ac40886 + .quad 0x3df219f38e8afafe + .quad 0x3e62416bf9669a04 + .quad 0x3e611c96b2b3987f + .quad 0x3e6f99ed447e1177 + .quad 0x3e13245826328a11 + .quad 0x3e66f56dd1e645f8 + .quad 0x3e46164946945535 + .quad 0x3e5e37d59d190028 + .quad 0x3e668671f12bf828 + .quad 0x3e6e8ecbca6aabbd + .quad 0x3e53f49e109a5912 + .quad 0x3e6b8a0e11ec3043 + .quad 0x3e65fae00aed691a + .quad 0x3e6c0569bece3e4a + .quad 0x3e605e26744efbfe + .quad 0x3e65b570a94be5c5 + .quad 0x3e5d6f156ea0e063 + .quad 0x3e6e0ca7612fc484 + .quad 0x3e4963c927b25258 + .quad 0x3e547930aa725a5c + .quad 0x3e58a79fe3af43b3 + .quad 0x3e5e6dc29c41bdaf + .quad 0x3e657a2e76f863a5 + .quad 0x3e2ae3b61716354d + .quad 0x3e665fb5df6906b1 + .quad 0x3e66177d7f588f7b + .quad 0x3e3ad55abd091b67 + .quad 0x3e155337b2422d76 + .quad 0x3e6084ebe86972d5 + .quad 0x3e656395808e1ea3 + .quad 0x3e61bce21b40fba7 + .quad 0x3e5006f94605b515 + .quad 0x3e6aa676aceb1f7d + .quad 0x3e58229f76554ce6 + .quad 0x3e6eabfc6cf57330 + .quad 0x3e64daed9c0ce8bc + .quad 0x3e60ff1768237141 + .quad 0x3e6575f83051b085 + .quad 0x3e42667deb523e29 + .quad 0x3e1816996954f4fd + .quad 0x3e587cfccf4d9cd4 + .quad 0x3e52c5d018198353 + .quad 0x3e6a7a898dcc34aa + .quad 0x3e2cead6dadc36d1 + .quad 0x3e2a55759c498bdf + .quad 0x3e6c414a9ef6de04 + .quad 0x3e63e2108a6e58fa + .quad 0x3e5587fd7643d77c + .quad 0x3e3901eb1d3ff3df + .quad 0x3e6f2ccd7c812fc6 + .quad 0x3e21c8ee70a01049 + .quad 0x3e563e8d02831eec + .quad 0x3e6f61a42a92c7ff + .quad 0x3dda917399c84d24 + .quad 0x3e5e9197c8eec2f0 + .quad 0x3e5e6f842f5a1378 + .quad 0x3e2fac242a90a0fc + .quad 0x3e535ed726610227 + .quad 0x3e50e0d64804b15b + .quad 0x3e0560675daba814 + .quad 0x3e637388c8768032 + .quad 0x3e3ee3c89f9e01f5 + .quad 0x3e639f6f0d09747c + .quad 0x3e4322c327abb8f0 + .quad 0x3e6961b347c8ac80 + .quad 0x3e63711fbbd0f118 + .quad 0x3e64fad8d7718ffb + .quad 0x3e6fffffffffffff + .quad 0x3e667efa79ec35b4 + .quad 0x3e6a737687a254a8 + .quad 0x3e5bace0f87d924d + .quad 0x3e629e37c237e392 + .quad 0x3e557ce7ac3f3012 + .quad 0x3e682829359f8fbd + .quad 0x3e6cc9be42d14676 + .quad 0x3e6a8f001c137d0b + .quad 0x3e636127687dda05 + .quad 0x3e524dba322646f0 + .quad 0x3e6dc43f1ed210b4 + +.align 32 +.L__INV_TAB_256: .quad 0x4000000000000000 + .quad 0x3fffe01fe01fe020 + .quad 0x3fffc07f01fc07f0 + .quad 0x3fffa11caa01fa12 + .quad 0x3fff81f81f81f820 + .quad 0x3fff6310aca0dbb5 + .quad 0x3fff44659e4a4271 + .quad 0x3fff25f644230ab5 + .quad 0x3fff07c1f07c1f08 + .quad 0x3ffee9c7f8458e02 + .quad 0x3ffecc07b301ecc0 + .quad 0x3ffeae807aba01eb + .quad 0x3ffe9131abf0b767 + .quad 0x3ffe741aa59750e4 + .quad 0x3ffe573ac901e574 + .quad 0x3ffe3a9179dc1a73 + .quad 0x3ffe1e1e1e1e1e1e + .quad 0x3ffe01e01e01e01e + .quad 0x3ffde5d6e3f8868a + .quad 0x3ffdca01dca01dca + .quad 0x3ffdae6076b981db + .quad 0x3ffd92f2231e7f8a + .quad 0x3ffd77b654b82c34 + .quad 0x3ffd5cac807572b2 + .quad 0x3ffd41d41d41d41d + .quad 0x3ffd272ca3fc5b1a + .quad 0x3ffd0cb58f6ec074 + .quad 0x3ffcf26e5c44bfc6 + .quad 0x3ffcd85689039b0b + .quad 0x3ffcbe6d9601cbe7 + .quad 0x3ffca4b3055ee191 + .quad 0x3ffc8b265afb8a42 + .quad 0x3ffc71c71c71c71c + .quad 0x3ffc5894d10d4986 + .quad 0x3ffc3f8f01c3f8f0 + .quad 0x3ffc26b5392ea01c + .quad 0x3ffc0e070381c0e0 + .quad 0x3ffbf583ee868d8b + .quad 0x3ffbdd2b899406f7 + .quad 0x3ffbc4fd65883e7b + .quad 0x3ffbacf914c1bad0 + .quad 0x3ffb951e2b18ff23 + .quad 0x3ffb7d6c3dda338b + .quad 0x3ffb65e2e3beee05 + .quad 0x3ffb4e81b4e81b4f + .quad 0x3ffb37484ad806ce + .quad 0x3ffb2036406c80d9 + .quad 0x3ffb094b31d922a4 + .quad 0x3ffaf286bca1af28 + .quad 0x3ffadbe87f94905e + .quad 0x3ffac5701ac5701b + .quad 0x3ffaaf1d2f87ebfd + .quad 0x3ffa98ef606a63be + .quad 0x3ffa82e65130e159 + .quad 0x3ffa6d01a6d01a6d + .quad 0x3ffa574107688a4a + .quad 0x3ffa41a41a41a41a + .quad 0x3ffa2c2a87c51ca0 + .quad 0x3ffa16d3f97a4b02 + .quad 0x3ffa01a01a01a01a + .quad 0x3ff9ec8e951033d9 + .quad 0x3ff9d79f176b682d + .quad 0x3ff9c2d14ee4a102 + .quad 0x3ff9ae24ea5510da + .quad 0x3ff999999999999a + .quad 0x3ff9852f0d8ec0ff + .quad 0x3ff970e4f80cb872 + .quad 0x3ff95cbb0be377ae + .quad 0x3ff948b0fcd6e9e0 + .quad 0x3ff934c67f9b2ce6 + .quad 0x3ff920fb49d0e229 + .quad 0x3ff90d4f120190d5 + .quad 0x3ff8f9c18f9c18fa + .quad 0x3ff8e6527af1373f + .quad 0x3ff8d3018d3018d3 + .quad 0x3ff8bfce8062ff3a + .quad 0x3ff8acb90f6bf3aa + .quad 0x3ff899c0f601899c + .quad 0x3ff886e5f0abb04a + .quad 0x3ff87427bcc092b9 + .quad 0x3ff8618618618618 + .quad 0x3ff84f00c2780614 + .quad 0x3ff83c977ab2bedd + .quad 0x3ff82a4a0182a4a0 + .quad 0x3ff8181818181818 + .quad 0x3ff8060180601806 + .quad 0x3ff7f405fd017f40 + .quad 0x3ff7e225515a4f1d + .quad 0x3ff7d05f417d05f4 + .quad 0x3ff7beb3922e017c + .quad 0x3ff7ad2208e0ecc3 + .quad 0x3ff79baa6bb6398b + .quad 0x3ff78a4c8178a4c8 + .quad 0x3ff77908119ac60d + .quad 0x3ff767dce434a9b1 + .quad 0x3ff756cac201756d + .quad 0x3ff745d1745d1746 + .quad 0x3ff734f0c541fe8d + .quad 0x3ff724287f46debc + .quad 0x3ff713786d9c7c09 + .quad 0x3ff702e05c0b8170 + .quad 0x3ff6f26016f26017 + .quad 0x3ff6e1f76b4337c7 + .quad 0x3ff6d1a62681c861 + .quad 0x3ff6c16c16c16c17 + .quad 0x3ff6b1490aa31a3d + .quad 0x3ff6a13cd1537290 + .quad 0x3ff691473a88d0c0 + .quad 0x3ff6816816816817 + .quad 0x3ff6719f3601671a + .quad 0x3ff661ec6a5122f9 + .quad 0x3ff6524f853b4aa3 + .quad 0x3ff642c8590b2164 + .quad 0x3ff63356b88ac0de + .quad 0x3ff623fa77016240 + .quad 0x3ff614b36831ae94 + .quad 0x3ff6058160581606 + .quad 0x3ff5f66434292dfc + .quad 0x3ff5e75bb8d015e7 + .quad 0x3ff5d867c3ece2a5 + .quad 0x3ff5c9882b931057 + .quad 0x3ff5babcc647fa91 + .quad 0x3ff5ac056b015ac0 + .quad 0x3ff59d61f123ccaa + .quad 0x3ff58ed2308158ed + .quad 0x3ff5805601580560 + .quad 0x3ff571ed3c506b3a + .quad 0x3ff56397ba7c52e2 + .quad 0x3ff5555555555555 + .quad 0x3ff54725e6bb82fe + .quad 0x3ff5390948f40feb + .quad 0x3ff52aff56a8054b + .quad 0x3ff51d07eae2f815 + .quad 0x3ff50f22e111c4c5 + .quad 0x3ff5015015015015 + .quad 0x3ff4f38f62dd4c9b + .quad 0x3ff4e5e0a72f0539 + .quad 0x3ff4d843bedc2c4c + .quad 0x3ff4cab88725af6e + .quad 0x3ff4bd3edda68fe1 + .quad 0x3ff4afd6a052bf5b + .quad 0x3ff4a27fad76014a + .quad 0x3ff49539e3b2d067 + .quad 0x3ff4880522014880 + .quad 0x3ff47ae147ae147b + .quad 0x3ff46dce34596066 + .quad 0x3ff460cbc7f5cf9a + .quad 0x3ff453d9e2c776ca + .quad 0x3ff446f86562d9fb + .quad 0x3ff43a2730abee4d + .quad 0x3ff42d6625d51f87 + .quad 0x3ff420b5265e5951 + .quad 0x3ff4141414141414 + .quad 0x3ff40782d10e6566 + .quad 0x3ff3fb013fb013fb + .quad 0x3ff3ee8f42a5af07 + .quad 0x3ff3e22cbce4a902 + .quad 0x3ff3d5d991aa75c6 + .quad 0x3ff3c995a47babe7 + .quad 0x3ff3bd60d9232955 + .quad 0x3ff3b13b13b13b14 + .quad 0x3ff3a524387ac822 + .quad 0x3ff3991c2c187f63 + .quad 0x3ff38d22d366088e + .quad 0x3ff3813813813814 + .quad 0x3ff3755bd1c945ee + .quad 0x3ff3698df3de0748 + .quad 0x3ff35dce5f9f2af8 + .quad 0x3ff3521cfb2b78c1 + .quad 0x3ff34679ace01346 + .quad 0x3ff33ae45b57bcb2 + .quad 0x3ff32f5ced6a1dfa + .quad 0x3ff323e34a2b10bf + .quad 0x3ff3187758e9ebb6 + .quad 0x3ff30d190130d190 + .quad 0x3ff301c82ac40260 + .quad 0x3ff2f684bda12f68 + .quad 0x3ff2eb4ea1fed14b + .quad 0x3ff2e025c04b8097 + .quad 0x3ff2d50a012d50a0 + .quad 0x3ff2c9fb4d812ca0 + .quad 0x3ff2bef98e5a3711 + .quad 0x3ff2b404ad012b40 + .quad 0x3ff2a91c92f3c105 + .quad 0x3ff29e4129e4129e + .quad 0x3ff293725bb804a5 + .quad 0x3ff288b01288b013 + .quad 0x3ff27dfa38a1ce4d + .quad 0x3ff27350b8812735 + .quad 0x3ff268b37cd60127 + .quad 0x3ff25e22708092f1 + .quad 0x3ff2539d7e9177b2 + .quad 0x3ff2492492492492 + .quad 0x3ff23eb79717605b + .quad 0x3ff23456789abcdf + .quad 0x3ff22a0122a0122a + .quad 0x3ff21fb78121fb78 + .quad 0x3ff21579804855e6 + .quad 0x3ff20b470c67c0d9 + .quad 0x3ff2012012012012 + .quad 0x3ff1f7047dc11f70 + .quad 0x3ff1ecf43c7fb84c + .quad 0x3ff1e2ef3b3fb874 + .quad 0x3ff1d8f5672e4abd + .quad 0x3ff1cf06ada2811d + .quad 0x3ff1c522fc1ce059 + .quad 0x3ff1bb4a4046ed29 + .quad 0x3ff1b17c67f2bae3 + .quad 0x3ff1a7b9611a7b96 + .quad 0x3ff19e0119e0119e + .quad 0x3ff19453808ca29c + .quad 0x3ff18ab083902bdb + .quad 0x3ff1811811811812 + .quad 0x3ff1778a191bd684 + .quad 0x3ff16e0689427379 + .quad 0x3ff1648d50fc3201 + .quad 0x3ff15b1e5f75270d + .quad 0x3ff151b9a3fdd5c9 + .quad 0x3ff1485f0e0acd3b + .quad 0x3ff13f0e8d344724 + .quad 0x3ff135c81135c811 + .quad 0x3ff12c8b89edc0ac + .quad 0x3ff12358e75d3033 + .quad 0x3ff11a3019a74826 + .quad 0x3ff1111111111111 + .quad 0x3ff107fbbe011080 + .quad 0x3ff0fef010fef011 + .quad 0x3ff0f5edfab325a2 + .quad 0x3ff0ecf56be69c90 + .quad 0x3ff0e40655826011 + .quad 0x3ff0db20a88f4696 + .quad 0x3ff0d24456359e3a + .quad 0x3ff0c9714fbcda3b + .quad 0x3ff0c0a7868b4171 + .quad 0x3ff0b7e6ec259dc8 + .quad 0x3ff0af2f722eecb5 + .quad 0x3ff0a6810a6810a7 + .quad 0x3ff09ddba6af8360 + .quad 0x3ff0953f39010954 + .quad 0x3ff08cabb37565e2 + .quad 0x3ff0842108421084 + .quad 0x3ff07b9f29b8eae2 + .quad 0x3ff073260a47f7c6 + .quad 0x3ff06ab59c7912fb + .quad 0x3ff0624dd2f1a9fc + .quad 0x3ff059eea0727586 + .quad 0x3ff05197f7d73404 + .quad 0x3ff04949cc1664c5 + .quad 0x3ff0410410410410 + .quad 0x3ff038c6b78247fc + .quad 0x3ff03091b51f5e1a + .quad 0x3ff02864fc7729e9 + .quad 0x3ff0204081020408 + .quad 0x3ff0182436517a37 + .quad 0x3ff0101010101010 + .quad 0x3ff0080402010080 + .quad 0x3ff0000000000000 +
diff --git a/src/gas/cbrtf.S b/src/gas/cbrtf.S new file mode 100644 index 0000000..21bdd0b --- /dev/null +++ b/src/gas/cbrtf.S
@@ -0,0 +1,717 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# cbrtf.S +# +# An implementation of the cbrtf libm function. +# +# Prototype: +# +# float cbrtf(float x); +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(cbrtf) +#define fname_special _cbrtf_special + + +# local variable storage offsets + +.equ store_input, 0x0 +.equ stack_size, 0x20 + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 32 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + xor %rcx,%rcx + sub $stack_size, %rsp + movss %xmm0, store_input(%rsp) + movss %xmm0,%xmm1 + mov store_input(%rsp),%r8 + mov $0x7F800000,%r10 + mov $0x007FFFFF,%r11 + mov %r8,%r9 + and %r10,%r8 # r8 = stores the exponent + and %r11,%r9 # r9 = stores the mantissa + cmp $0X7F800000,%r8 + jz .L__cbrtf_is_nan_infinite + cmp $0X0,%r8 + jz .L__cbrtf_is_denormal +.align 32 +.L__cbrtf_is_normal: + cvtps2pd %xmm1,%xmm1 + shr $23,%r8 # exp value + mov $3,%rdx # check whether always dx is set to 3 + mov %r8,%rax + movsd %xmm1,%xmm6 + shr $15,%r9 # index for the reciprocal + sub $0x7F,%ax + idiv %dl # Accumulator is divided by dl=3 + mov %ax,%dx + shr $8,%dx #dx = Contains the remainder + add $2,%dl + #ax = Contains the quotient, Scale factor + cbw # sign extend al to ax + add $0x3FF,%ax + shl $52,%rax + pand .L__mantissa_mask_64(%rip),%xmm1 + mov %rax,store_input(%rsp) + movsd store_input(%rsp),%xmm7 + movsd .L__sign_mask_64(%rip),%xmm2 + por .L__one_mask_64(%rip),%xmm1 + movapd .L__coefficients(%rip),%xmm0 + pandn %xmm1,%xmm2 + pand .L__sign_mask_64(%rip),%xmm6 # has the sign + lea .L__DoubleReciprocalTable_256(%rip),%r8 + lea .L__CubeRootTable_256(%rip),%rax + movsd (%r8,%r9,8),%xmm3#reciprocal, Size of double is 8 + movsd (%rax,%r9,8),%xmm4#cuberoot + mulsd %xmm2,%xmm3 + subsd .L__one_mask_64(%rip),%xmm3 + + # movddup %xmm3,%xmm3 + shufpd $0,%xmm3,%xmm3 # replacing movddup + + mulsd %xmm3,%xmm3 + mulpd %xmm3,%xmm0 +####################################################################### +#haddpd is an SSE3 instruction On using this instruction it gives a better performance + #haddpd %xmm0,%xmm0 +#Following has to be commented and the above haddpd has to be uncommented if we can +#use the SSE3 instructions + movapd %xmm0,%xmm3 + unpckhpd %xmm3,%xmm3 + addsd %xmm3,%xmm0 +####################################################################### + addsd .L__one_mask_64(%rip),%xmm0 + mulsd %xmm7,%xmm0 + lea .L__defined_cuberoot(%rip),%rax + mulsd (%rax,%rdx,8),%xmm0 + + mulsd %xmm4,%xmm0 + cmp $1,%cx + jnz .L__final_result + mulsd .L__denormal_factor(%rip),%xmm0 + +.align 32 +.L__final_result: + por %xmm6, %xmm0 + cvtsd2ss %xmm0,%xmm0 + add $stack_size, %rsp + ret + + +.align 32 +.L__cbrtf_is_denormal: + cmp $0,%r9 + jz .L__cbrtf_is_zero + mulss .L__2_pow_23(%rip),%xmm1 + movss %xmm1, store_input(%rsp) + mov $1,%cx + mov store_input(%rsp),%r8 + mov %r8,%r9 + and %r10,%r8 # r8 = stores the exponent + and %r11,%r9 # r9 = stores the mantissa + jmp .L__cbrtf_is_normal + +.align 32 +.L__cbrtf_is_nan_infinite: + cmp $0,%r9 + jz .L__cbrtf_is_infinite + mulss %xmm0,%xmm0 #this multiplication will raise an invalid exception + por .L__qnan_mask_32(%rip),%xmm0 + +.L__cbrtf_is_infinite: +.L__cbrtf_is_one: +.L__cbrtf_is_zero: + add $stack_size, %rsp + ret + +.align 32 +.L__mantissa_mask_32: .long 0x007FFFFF + .long 0 #this zero is necessary +.align 16 +.L__qnan_mask_32: .long 0x00400000 + .long 0 +.L__exp_mask_32: .long 0x7F800000 + .long 0 +.L__zero: .long 0x00000000 + .long 0 +.align 16 +.L__mantissa_mask_64: .quad 0x000FFFFFFFFFFFFF +.L__2_pow_23: .long 0x4B000000 + + +.align 16 +.L__sign_mask_64: .quad 0x8000000000000000 + .quad 0 +.L__one_mask_64: .quad 0x3FF0000000000000 + .quad 0 + +.align 16 +.L__denormal_factor: .quad 0x3F7428A2F98D728B + .quad 0 +.align 16 +.L__coefficients: + .quad 0xbFBC71C71C71C71C + .quad 0x3fd5555555555555 +.align 16 +.L__defined_cuberoot: .quad 0x3FE428A2F98D728B + .quad 0x3FE965FEA53D6E3D + .quad 0x3FF0000000000000 + .quad 0x3FF428A2F98D728B + .quad 0x3FF965FEA53D6E3D + +.align 32 +.L__DoubleReciprocalTable_256: .quad 0X3ff0000000000000 + .quad 0X3fefe00000000000 + .quad 0X3fefc00000000000 + .quad 0X3fefa00000000000 + .quad 0X3fef800000000000 + .quad 0X3fef600000000000 + .quad 0X3fef400000000000 + .quad 0X3fef200000000000 + .quad 0X3fef000000000000 + .quad 0X3feee00000000000 + .quad 0X3feec00000000000 + .quad 0X3feea00000000000 + .quad 0X3fee900000000000 + .quad 0X3fee700000000000 + .quad 0X3fee500000000000 + .quad 0X3fee300000000000 + .quad 0X3fee100000000000 + .quad 0X3fee000000000000 + .quad 0X3fede00000000000 + .quad 0X3fedc00000000000 + .quad 0X3feda00000000000 + .quad 0X3fed900000000000 + .quad 0X3fed700000000000 + .quad 0X3fed500000000000 + .quad 0X3fed400000000000 + .quad 0X3fed200000000000 + .quad 0X3fed000000000000 + .quad 0X3fecf00000000000 + .quad 0X3fecd00000000000 + .quad 0X3fecb00000000000 + .quad 0X3feca00000000000 + .quad 0X3fec800000000000 + .quad 0X3fec700000000000 + .quad 0X3fec500000000000 + .quad 0X3fec300000000000 + .quad 0X3fec200000000000 + .quad 0X3fec000000000000 + .quad 0X3febf00000000000 + .quad 0X3febd00000000000 + .quad 0X3febc00000000000 + .quad 0X3feba00000000000 + .quad 0X3feb900000000000 + .quad 0X3feb700000000000 + .quad 0X3feb600000000000 + .quad 0X3feb400000000000 + .quad 0X3feb300000000000 + .quad 0X3feb200000000000 + .quad 0X3feb000000000000 + .quad 0X3feaf00000000000 + .quad 0X3fead00000000000 + .quad 0X3feac00000000000 + .quad 0X3feaa00000000000 + .quad 0X3fea900000000000 + .quad 0X3fea800000000000 + .quad 0X3fea600000000000 + .quad 0X3fea500000000000 + .quad 0X3fea400000000000 + .quad 0X3fea200000000000 + .quad 0X3fea100000000000 + .quad 0X3fea000000000000 + .quad 0X3fe9e00000000000 + .quad 0X3fe9d00000000000 + .quad 0X3fe9c00000000000 + .quad 0X3fe9a00000000000 + .quad 0X3fe9900000000000 + .quad 0X3fe9800000000000 + .quad 0X3fe9700000000000 + .quad 0X3fe9500000000000 + .quad 0X3fe9400000000000 + .quad 0X3fe9300000000000 + .quad 0X3fe9200000000000 + .quad 0X3fe9000000000000 + .quad 0X3fe8f00000000000 + .quad 0X3fe8e00000000000 + .quad 0X3fe8d00000000000 + .quad 0X3fe8b00000000000 + .quad 0X3fe8a00000000000 + .quad 0X3fe8900000000000 + .quad 0X3fe8800000000000 + .quad 0X3fe8700000000000 + .quad 0X3fe8600000000000 + .quad 0X3fe8400000000000 + .quad 0X3fe8300000000000 + .quad 0X3fe8200000000000 + .quad 0X3fe8100000000000 + .quad 0X3fe8000000000000 + .quad 0X3fe7f00000000000 + .quad 0X3fe7e00000000000 + .quad 0X3fe7d00000000000 + .quad 0X3fe7b00000000000 + .quad 0X3fe7a00000000000 + .quad 0X3fe7900000000000 + .quad 0X3fe7800000000000 + .quad 0X3fe7700000000000 + .quad 0X3fe7600000000000 + .quad 0X3fe7500000000000 + .quad 0X3fe7400000000000 + .quad 0X3fe7300000000000 + .quad 0X3fe7200000000000 + .quad 0X3fe7100000000000 + .quad 0X3fe7000000000000 + .quad 0X3fe6f00000000000 + .quad 0X3fe6e00000000000 + .quad 0X3fe6d00000000000 + .quad 0X3fe6c00000000000 + .quad 0X3fe6b00000000000 + .quad 0X3fe6a00000000000 + .quad 0X3fe6900000000000 + .quad 0X3fe6800000000000 + .quad 0X3fe6700000000000 + .quad 0X3fe6600000000000 + .quad 0X3fe6500000000000 + .quad 0X3fe6400000000000 + .quad 0X3fe6300000000000 + .quad 0X3fe6200000000000 + .quad 0X3fe6100000000000 + .quad 0X3fe6000000000000 + .quad 0X3fe5f00000000000 + .quad 0X3fe5e00000000000 + .quad 0X3fe5d00000000000 + .quad 0X3fe5c00000000000 + .quad 0X3fe5b00000000000 + .quad 0X3fe5a00000000000 + .quad 0X3fe5900000000000 + .quad 0X3fe5800000000000 + .quad 0X3fe5800000000000 + .quad 0X3fe5700000000000 + .quad 0X3fe5600000000000 + .quad 0X3fe5500000000000 + .quad 0X3fe5400000000000 + .quad 0X3fe5300000000000 + .quad 0X3fe5200000000000 + .quad 0X3fe5100000000000 + .quad 0X3fe5000000000000 + .quad 0X3fe5000000000000 + .quad 0X3fe4f00000000000 + .quad 0X3fe4e00000000000 + .quad 0X3fe4d00000000000 + .quad 0X3fe4c00000000000 + .quad 0X3fe4b00000000000 + .quad 0X3fe4a00000000000 + .quad 0X3fe4a00000000000 + .quad 0X3fe4900000000000 + .quad 0X3fe4800000000000 + .quad 0X3fe4700000000000 + .quad 0X3fe4600000000000 + .quad 0X3fe4600000000000 + .quad 0X3fe4500000000000 + .quad 0X3fe4400000000000 + .quad 0X3fe4300000000000 + .quad 0X3fe4200000000000 + .quad 0X3fe4200000000000 + .quad 0X3fe4100000000000 + .quad 0X3fe4000000000000 + .quad 0X3fe3f00000000000 + .quad 0X3fe3e00000000000 + .quad 0X3fe3e00000000000 + .quad 0X3fe3d00000000000 + .quad 0X3fe3c00000000000 + .quad 0X3fe3b00000000000 + .quad 0X3fe3b00000000000 + .quad 0X3fe3a00000000000 + .quad 0X3fe3900000000000 + .quad 0X3fe3800000000000 + .quad 0X3fe3800000000000 + .quad 0X3fe3700000000000 + .quad 0X3fe3600000000000 + .quad 0X3fe3500000000000 + .quad 0X3fe3500000000000 + .quad 0X3fe3400000000000 + .quad 0X3fe3300000000000 + .quad 0X3fe3200000000000 + .quad 0X3fe3200000000000 + .quad 0X3fe3100000000000 + .quad 0X3fe3000000000000 + .quad 0X3fe3000000000000 + .quad 0X3fe2f00000000000 + .quad 0X3fe2e00000000000 + .quad 0X3fe2e00000000000 + .quad 0X3fe2d00000000000 + .quad 0X3fe2c00000000000 + .quad 0X3fe2b00000000000 + .quad 0X3fe2b00000000000 + .quad 0X3fe2a00000000000 + .quad 0X3fe2900000000000 + .quad 0X3fe2900000000000 + .quad 0X3fe2800000000000 + .quad 0X3fe2700000000000 + .quad 0X3fe2700000000000 + .quad 0X3fe2600000000000 + .quad 0X3fe2500000000000 + .quad 0X3fe2500000000000 + .quad 0X3fe2400000000000 + .quad 0X3fe2300000000000 + .quad 0X3fe2300000000000 + .quad 0X3fe2200000000000 + .quad 0X3fe2100000000000 + .quad 0X3fe2100000000000 + .quad 0X3fe2000000000000 + .quad 0X3fe2000000000000 + .quad 0X3fe1f00000000000 + .quad 0X3fe1e00000000000 + .quad 0X3fe1e00000000000 + .quad 0X3fe1d00000000000 + .quad 0X3fe1c00000000000 + .quad 0X3fe1c00000000000 + .quad 0X3fe1b00000000000 + .quad 0X3fe1b00000000000 + .quad 0X3fe1a00000000000 + .quad 0X3fe1900000000000 + .quad 0X3fe1900000000000 + .quad 0X3fe1800000000000 + .quad 0X3fe1800000000000 + .quad 0X3fe1700000000000 + .quad 0X3fe1600000000000 + .quad 0X3fe1600000000000 + .quad 0X3fe1500000000000 + .quad 0X3fe1500000000000 + .quad 0X3fe1400000000000 + .quad 0X3fe1300000000000 + .quad 0X3fe1300000000000 + .quad 0X3fe1200000000000 + .quad 0X3fe1200000000000 + .quad 0X3fe1100000000000 + .quad 0X3fe1100000000000 + .quad 0X3fe1000000000000 + .quad 0X3fe0f00000000000 + .quad 0X3fe0f00000000000 + .quad 0X3fe0e00000000000 + .quad 0X3fe0e00000000000 + .quad 0X3fe0d00000000000 + .quad 0X3fe0d00000000000 + .quad 0X3fe0c00000000000 + .quad 0X3fe0c00000000000 + .quad 0X3fe0b00000000000 + .quad 0X3fe0a00000000000 + .quad 0X3fe0a00000000000 + .quad 0X3fe0900000000000 + .quad 0X3fe0900000000000 + .quad 0X3fe0800000000000 + .quad 0X3fe0800000000000 + .quad 0X3fe0700000000000 + .quad 0X3fe0700000000000 + .quad 0X3fe0600000000000 + .quad 0X3fe0600000000000 + .quad 0X3fe0500000000000 + .quad 0X3fe0500000000000 + .quad 0X3fe0400000000000 + .quad 0X3fe0400000000000 + .quad 0X3fe0300000000000 + .quad 0X3fe0300000000000 + .quad 0X3fe0200000000000 + .quad 0X3fe0200000000000 + .quad 0X3fe0100000000000 + .quad 0X3fe0100000000000 + .quad 0X3fe0000000000000 + +.align 32 +.L__CubeRootTable_256: .quad 0X3ff0000000000000 + .quad 0X3ff00558e6547c36 + .quad 0X3ff00ab8f9d2f374 + .quad 0X3ff010204b673fc7 + .quad 0X3ff0158eec36749b + .quad 0X3ff01b04ed9fdb53 + .quad 0X3ff02082613df53c + .quad 0X3ff0260758e78308 + .quad 0X3ff02b93e6b091f0 + .quad 0X3ff031281ceb8ea2 + .quad 0X3ff036c40e2a5e2a + .quad 0X3ff03c67cd3f7cea + .quad 0X3ff03f3c9fee224c + .quad 0X3ff044ec379f7f79 + .quad 0X3ff04aa3cd578d67 + .quad 0X3ff0506374d40a3d + .quad 0X3ff0562b4218a6e3 + .quad 0X3ff059123d3a9848 + .quad 0X3ff05ee6694e7166 + .quad 0X3ff064c2ee6e07c6 + .quad 0X3ff06aa7e19c01c5 + .quad 0X3ff06d9d8b1decca + .quad 0X3ff0738f4b6cc8e2 + .quad 0X3ff07989af9f9f59 + .quad 0X3ff07c8a2611201c + .quad 0X3ff08291a9958f03 + .quad 0X3ff088a208c3fe28 + .quad 0X3ff08bad91dd7d8b + .quad 0X3ff091cb6588465e + .quad 0X3ff097f24eab04a1 + .quad 0X3ff09b0932aee3f2 + .quad 0X3ff0a13de8970de4 + .quad 0X3ff0a45bc08a5ac7 + .quad 0X3ff0aa9e79bfa986 + .quad 0X3ff0b0eaa961ca5b + .quad 0X3ff0b4145573271c + .quad 0X3ff0ba6ee5f9aad4 + .quad 0X3ff0bd9fd0dbe02d + .quad 0X3ff0c408fc1cfd4b + .quad 0X3ff0c741430e2059 + .quad 0X3ff0cdb9442ea813 + .quad 0X3ff0d0f905168e6c + .quad 0X3ff0d7801893d261 + .quad 0X3ff0dac772091bde + .quad 0X3ff0e15dd5c330ab + .quad 0X3ff0e4ace71080a4 + .quad 0X3ff0e7fe920f3037 + .quad 0X3ff0eea9c37e497e + .quad 0X3ff0f203512f4314 + .quad 0X3ff0f8be68db7f32 + .quad 0X3ff0fc1ffa42d902 + .quad 0X3ff102eb3af9ed89 + .quad 0X3ff10654f1e29cfb + .quad 0X3ff109c1679c189f + .quad 0X3ff110a29f080b3d + .quad 0X3ff114176891738a + .quad 0X3ff1178f0099b429 + .quad 0X3ff11e86ac2cd7ab + .quad 0X3ff12206c7cf4046 + .quad 0X3ff12589c21fb842 + .quad 0X3ff12c986355d0d2 + .quad 0X3ff13024129645cf + .quad 0X3ff133b2b13aa0eb + .quad 0X3ff13ad8cdc48ba3 + .quad 0X3ff13e70544b1d4f + .quad 0X3ff1420adb77c99a + .quad 0X3ff145a867b1bfea + .quad 0X3ff14ceca1189d6d + .quad 0X3ff15093574284e9 + .quad 0X3ff1543d2473ea9b + .quad 0X3ff157ea0d433a46 + .quad 0X3ff15f4d44462724 + .quad 0X3ff163039bd7cde6 + .quad 0X3ff166bd21c3a8e2 + .quad 0X3ff16a79dad1fb59 + .quad 0X3ff171fcf9aaac3d + .quad 0X3ff175c3693980c3 + .quad 0X3ff1798d1f73f3ef + .quad 0X3ff17d5a2156e97f + .quad 0X3ff1812a73ea2593 + .quad 0X3ff184fe1c406b8f + .quad 0X3ff18caf82b8dba4 + .quad 0X3ff1908d4b38a510 + .quad 0X3ff1946e7e36f7e5 + .quad 0X3ff1985320ff72a2 + .quad 0X3ff19c3b38e975a8 + .quad 0X3ff1a026cb58453d + .quad 0X3ff1a415ddbb2c10 + .quad 0X3ff1a808758d9e32 + .quad 0X3ff1aff84bac98ea + .quad 0X3ff1b3f5952e1a50 + .quad 0X3ff1b7f67a896220 + .quad 0X3ff1bbfb0178d186 + .quad 0X3ff1c0032fc3cf91 + .quad 0X3ff1c40f0b3eefc4 + .quad 0X3ff1c81e99cc193f + .quad 0X3ff1cc31e15aae72 + .quad 0X3ff1d048e7e7b565 + .quad 0X3ff1d463b37e0090 + .quad 0X3ff1d8824a365852 + .quad 0X3ff1dca4b237a4f7 + .quad 0X3ff1e0caf1b71965 + .quad 0X3ff1e4f50ef85e61 + .quad 0X3ff1e923104dbe76 + .quad 0X3ff1ed54fc185286 + .quad 0X3ff1f18ad8c82efc + .quad 0X3ff1f5c4acdc91aa + .quad 0X3ff1fa027ee4105b + .quad 0X3ff1fe44557cc808 + .quad 0X3ff2028a37548ccf + .quad 0X3ff206d42b291a95 + .quad 0X3ff20b2237c8466a + .quad 0X3ff20f74641030a6 + .quad 0X3ff213cab6ef77c7 + .quad 0X3ff2182537656c13 + .quad 0X3ff21c83ec824406 + .quad 0X3ff220e6dd675180 + .quad 0X3ff2254e114737d2 + .quad 0X3ff229b98f66228c + .quad 0X3ff22e295f19fd31 + .quad 0X3ff2329d87caabb6 + .quad 0X3ff2371610f243f2 + .quad 0X3ff23b93021d47da + .quad 0X3ff2401462eae0b8 + .quad 0X3ff2449a3b0d1b3f + .quad 0X3ff2449a3b0d1b3f + .quad 0X3ff2492492492492 + .quad 0X3ff24db370778844 + .quad 0X3ff25246dd846f45 + .quad 0X3ff256dee16fdfd4 + .quad 0X3ff25b7b844dfe71 + .quad 0X3ff2601cce474fd2 + .quad 0X3ff264c2c798fbe5 + .quad 0X3ff2696d789511e2 + .quad 0X3ff2696d789511e2 + .quad 0X3ff26e1ce9a2cd73 + .quad 0X3ff272d1233edcf3 + .quad 0X3ff2778a2dfba8d0 + .quad 0X3ff27c4812819c13 + .quad 0X3ff2810ad98f6e10 + .quad 0X3ff285d28bfa6d45 + .quad 0X3ff285d28bfa6d45 + .quad 0X3ff28a9f32aecb79 + .quad 0X3ff28f70d6afeb08 + .quad 0X3ff294478118ad83 + .quad 0X3ff299233b1bc38a + .quad 0X3ff299233b1bc38a + .quad 0X3ff29e040e03fdfb + .quad 0X3ff2a2ea0334a07b + .quad 0X3ff2a7d52429b556 + .quad 0X3ff2acc57a7862c2 + .quad 0X3ff2acc57a7862c2 + .quad 0X3ff2b1bb0fcf4190 + .quad 0X3ff2b6b5edf6b54a + .quad 0X3ff2bbb61ed145cf + .quad 0X3ff2c0bbac5bfa6e + .quad 0X3ff2c0bbac5bfa6e + .quad 0X3ff2c5c6a0aeb681 + .quad 0X3ff2cad705fc97a6 + .quad 0X3ff2cfece6945583 + .quad 0X3ff2cfece6945583 + .quad 0X3ff2d5084ce0a331 + .quad 0X3ff2da294368924f + .quad 0X3ff2df4fd4cff7c3 + .quad 0X3ff2df4fd4cff7c3 + .quad 0X3ff2e47c0bd7d237 + .quad 0X3ff2e9adf35eb25a + .quad 0X3ff2eee5966124e8 + .quad 0X3ff2eee5966124e8 + .quad 0X3ff2f422fffa1e92 + .quad 0X3ff2f9663b6369b6 + .quad 0X3ff2feaf53f61612 + .quad 0X3ff2feaf53f61612 + .quad 0X3ff303fe552aea57 + .quad 0X3ff309534a9ad7ce + .quad 0X3ff309534a9ad7ce + .quad 0X3ff30eae3fff6ff3 + .quad 0X3ff3140f41335c2f + .quad 0X3ff3140f41335c2f + .quad 0X3ff319765a32d7ae + .quad 0X3ff31ee3971c2b5b + .quad 0X3ff3245704302c13 + .quad 0X3ff3245704302c13 + .quad 0X3ff329d0add2bb20 + .quad 0X3ff32f50a08b48f9 + .quad 0X3ff32f50a08b48f9 + .quad 0X3ff334d6e9055a5f + .quad 0X3ff33a6394110fe6 + .quad 0X3ff33a6394110fe6 + .quad 0X3ff33ff6aea3afed + .quad 0X3ff3459045d8331b + .quad 0X3ff3459045d8331b + .quad 0X3ff34b3066efd36b + .quad 0X3ff350d71f529dd8 + .quad 0X3ff350d71f529dd8 + .quad 0X3ff356847c9006b4 + .quad 0X3ff35c388c5f80bf + .quad 0X3ff35c388c5f80bf + .quad 0X3ff361f35ca116ff + .quad 0X3ff361f35ca116ff + .quad 0X3ff367b4fb5e0985 + .quad 0X3ff36d7d76c96d0a + .quad 0X3ff36d7d76c96d0a + .quad 0X3ff3734cdd40cd95 + .quad 0X3ff379233d4cd42a + .quad 0X3ff379233d4cd42a + .quad 0X3ff37f00a5a1ef96 + .quad 0X3ff37f00a5a1ef96 + .quad 0X3ff384e52521006c + .quad 0X3ff38ad0cad80848 + .quad 0X3ff38ad0cad80848 + .quad 0X3ff390c3a602dc60 + .quad 0X3ff390c3a602dc60 + .quad 0X3ff396bdc60bdb88 + .quad 0X3ff39cbf3a8ca7a9 + .quad 0X3ff39cbf3a8ca7a9 + .quad 0X3ff3a2c8134ee2d1 + .quad 0X3ff3a2c8134ee2d1 + .quad 0X3ff3a8d8604cefe3 + .quad 0X3ff3aef031b2b706 + .quad 0X3ff3aef031b2b706 + .quad 0X3ff3b50f97de6de5 + .quad 0X3ff3b50f97de6de5 + .quad 0X3ff3bb36a36163d8 + .quad 0X3ff3bb36a36163d8 + .quad 0X3ff3c1656500d20a + .quad 0X3ff3c79bedb6afb8 + .quad 0X3ff3c79bedb6afb8 + .quad 0X3ff3cdda4eb28aa2 + .quad 0X3ff3cdda4eb28aa2 + .quad 0X3ff3d420995a63c0 + .quad 0X3ff3d420995a63c0 + .quad 0X3ff3da6edf4b9061 + .quad 0X3ff3da6edf4b9061 + .quad 0X3ff3e0c5325b9fc2 + .quad 0X3ff3e723a499453f + .quad 0X3ff3e723a499453f + .quad 0X3ff3ed8a484d473a + .quad 0X3ff3ed8a484d473a + .quad 0X3ff3f3f92ffb72d8 + .quad 0X3ff3f3f92ffb72d8 + .quad 0X3ff3fa706e6394a4 + .quad 0X3ff3fa706e6394a4 + .quad 0X3ff400f01682764a + .quad 0X3ff400f01682764a + .quad 0X3ff407783b92e17a + .quad 0X3ff407783b92e17a + .quad 0X3ff40e08f10ea81a + .quad 0X3ff40e08f10ea81a + .quad 0X3ff414a24aafb1e6 + .quad 0X3ff414a24aafb1e6 + .quad 0X3ff41b445c710fa7 + .quad 0X3ff41b445c710fa7 + .quad 0X3ff421ef3a901411 + .quad 0X3ff421ef3a901411 + .quad 0X3ff428a2f98d728b + + + + + +
diff --git a/src/gas/copysign.S b/src/gas/copysign.S new file mode 100644 index 0000000..d5b96cf --- /dev/null +++ b/src/gas/copysign.S
@@ -0,0 +1,63 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#copysign.S +# +# An implementation of the copysign libm function. +# +# The copysign functions produce a value with the magnitude of x and the sign of y. +# They produce a NaN (with the sign of y) if x is a NaN. On implementations that +# represent a signed zero but do not treat negative zero consistently in arithmetic +# operations, the copysign functions regard the sign of zero as positive. +# +# +# Prototype: +# +# double copysign(float x, float y) +# +# +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(copysign) + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + PSLLQ $1,%xmm0 + PSRLQ $1,%xmm0 + PSRLQ $63,%xmm1 + PSLLQ $63,%xmm1 + POR %xmm1,%xmm0 + + ret
diff --git a/src/gas/copysignf.S b/src/gas/copysignf.S new file mode 100644 index 0000000..90e63d6 --- /dev/null +++ b/src/gas/copysignf.S
@@ -0,0 +1,70 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#copysignf.S +# +# An implementation of the copysignf libm function. +# +# The copysign functions produce a value with the magnitude of x and the sign of y. +# They produce a NaN (with the sign of y) if x is a NaN. On implementations that +# represent a signed zero but do not treat negative zero consistently in arithmetic +# operations, the copysign functions regard the sign of zero as positive. +# +# Prototype: +# +# float copysignf(float x, float y)# +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(copysignf) + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + #PANDN .L__fabsf_and_mask, %xmm1 + #POR %xmm1,%xmm0 + + PSLLD $1,%xmm0 + PSRLD $1,%xmm0 + PSRLD $31,%xmm1 + PSLLD $31,%xmm1 + POR %xmm1,%xmm0 + + ret + +#.align 16 +#.L__sign_mask: .long 0x7FFFFFFF + .long 0x0 + .quad 0x0 +
diff --git a/src/gas/cos.S b/src/gas/cos.S new file mode 100644 index 0000000..dc227e0 --- /dev/null +++ b/src/gas/cos.S
@@ -0,0 +1,485 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# +# An implementation of the cos function. +# +# Prototype: +# +# double cos(double x); +# +# Computes cos(x). +# It will provide proper C99 return values, +# but may not raise floating point status bits properly. +# Based on the NAG C implementation. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 32 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0 # for alignment +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0 +.L__real_411E848000000000: .quad 0x415312d000000000 # 5e6 0x0411E848000000000 # 5e5 + .quad 0 +.L__real_bfe0000000000000: .quad 0x0bfe0000000000000 # - 0.5 + .quad 0 + +.align 32 +.Lcosarray: + .quad 0x3fa5555555555555 # 0.0416667 c1 + .quad 0 + .quad 0xbf56c16c16c16967 # -0.00138889 c2 + .quad 0 + .quad 0x3EFA01A019F4EC91 # 2.48016e-005 c3 + .quad 0 + .quad 0xbE927E4FA17F667B # -2.75573e-007 c4 + .quad 0 + .quad 0x3E21EEB690382EEC # 2.08761e-009 c5 + .quad 0 + .quad 0xbDA907DB47258AA7 # -1.13826e-011 c6 + .quad 0 + +.align 32 +.Lsinarray: + .quad 0xbfc5555555555555 # -0.166667 s1 + .quad 0 + .quad 0x3f81111111110bb3 # 0.00833333 s2 + .quad 0 + .quad 0xbf2a01a019e83e5c # -0.000198413 s3 + .quad 0 + .quad 0x3ec71de3796cde01 # 2.75573e-006 s4 + .quad 0 + .quad 0xbe5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0 + .quad 0x3de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0 + +.text +.align 32 +.p2align 5,,31 + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(cos) +#define fname_special _cos_special@PLT + +# define local variable storage offsets +.equ p_temp, 0x30 # temporary for get/put bits operation +.equ p_temp1, 0x40 # temporary for get/put bits operation +.equ r, 0x50 # pointer to r for amd_remainder_piby2 +.equ rr, 0x60 # pointer to rr for amd_remainder_piby2 +.equ region, 0x70 # pointer to region for amd_remainder_piby2 +.equ stack_size, 0x98 + +.globl fname +.type fname,@function + +fname: + sub $stack_size, %rsp + xorpd %xmm2, %xmm2 # zeroed out for later use + +# GET_BITS_DP64(x, ux); +# get the input value to an integer register. + movsd %xmm0,p_temp(%rsp) + mov p_temp(%rsp), %rdx # rdx is ux + +## if NaN or inf + mov $0x07ff0000000000000, %rax + mov %rax, %r10 + and %rdx, %r10 + cmp %rax, %r10 + jz .Lcos_naninf + +# ax = (ux & ~SIGNBIT_DP64); + mov $0x07fffffffffffffff, %r10 + and %rdx, %r10 # r10 is ax + mov $1, %r8d # for determining region later on + + +## if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */ + mov $0x03fe921fb54442d18, %rax + cmp %rax, %r10 + jg .Lcos_reduce + +## if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */ + mov $0x03f20000000000000, %rax + cmp %rax, %r10 + jge .Lcos_small + +## if (ax < 0x3e40000000000000) /* abs(x) < 2.0^(-27) */ + mov $0x03e40000000000000, %rax + cmp %rax, %r10 + jge .Lcos_smaller + +# cos = 1.0; + movsd .L__real_3ff0000000000000(%rip), %xmm0 # return a 1 + jmp .Lcos_cleanup + +## else +.align 16 +.Lcos_smaller: +# cos = 1.0 - x*x*0.5; + movsd %xmm0, %xmm2 + mulsd %xmm2, %xmm2 # x^2 + movsd .L__real_3ff0000000000000(%rip), %xmm0 # 1.0 + mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 * x^2 + subsd %xmm2, %xmm0 + jmp .Lcos_cleanup + +## else + +.align 16 +.Lcos_small: +# cos = cos_piby4(x, 0.0); + movsd %xmm0, %xmm2 + mulsd %xmm0, %xmm2 # x2 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# region 0 or 2 - do a cos calculation +# zc = (c2 + x2 * (c3 + x2 * (c4 + x2 * (c5 + x2 * c6)))); + + movsd .Lcosarray+0x10(%rip), %xmm1 # c2 + movsd %xmm2, %xmm4 # move for x4 + mulsd %xmm2, %xmm4 # x4 + movsd .Lcosarray+0x30(%rip), %xmm3 # c4 + mulsd %xmm2, %xmm1 # c2x2 + movsd .Lcosarray+0x50(%rip), %xmm5 # c6 + mulsd %xmm2, %xmm3 # c4x2 + movsd %xmm4, %xmm0 # move for x8 + mulsd %xmm2, %xmm5 # c6x2 + mulsd %xmm4, %xmm0 # x8 + addsd .Lcosarray(%rip), %xmm1 # c1 + c2x2 + mulsd %xmm4, %xmm1 # c1x4 + c2x6 + addsd .Lcosarray+0x20(%rip), %xmm3 # c3 + c4x2 + mulsd .L__real_bfe0000000000000(%rip), %xmm2 # -0.5x2, destroy xmm2 + addsd .Lcosarray+0x40(%rip), %xmm5 # c5 + c6x2 + mulsd %xmm0, %xmm3 # c3x8 + c4x10 + mulsd %xmm0, %xmm4 # x12 + mulsd %xmm5, %xmm4 # c5x12 + c6x14 + + movsd .L__real_3ff0000000000000(%rip), %xmm0 # 1 + addsd %xmm3, %xmm1 # c1x4 + c2x6 + c3x8 + c4x10 + movsd %xmm2, %xmm3 # preserve -0.5x2 + addsd %xmm0, %xmm2 # t = 1 - 0.5x2 + subsd %xmm2, %xmm0 # 1-t + addsd %xmm3, %xmm0 # (1-t) - r + addsd %xmm4, %xmm1 # c1x4 + c2x6 + c3x8 + c4x10 + c5x12 + c6x14 + addsd %xmm1, %xmm0 # (1-t) - r + c1x4 + c2x6 + c3x8 + c4x10 + c5x12 + c6x14 + addsd %xmm2, %xmm0 # 1 - 0.5x2 + above + + jmp .Lcos_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcos_reduce: + +# xneg = (ax != ux); + cmp %r10, %rdx + +## if (xneg) x = -x; + jz .Lpositive + subsd %xmm0, %xmm2 + movsd %xmm2, %xmm0 + +.Lpositive: +## if (x < 5.0e5) + cmp .L__real_411E848000000000(%rip), %r10 + jae .Lcos_reduce_precise + +# reduce the argument to be in a range from -pi/4 to +pi/4 +# by subtracting multiples of pi/2 + movsd %xmm0, %xmm2 + movsd .L__real_3fe45f306dc9c883(%rip), %xmm3 # twobypi + movsd %xmm0, %xmm4 + movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5 + mulsd %xmm3, %xmm2 + +#/* How many pi/2 is x a multiple of? */ +# xexp = ax >> EXPSHIFTBITS_DP64; + mov %r10, %r9 + shr $52, %r9 # >>EXPSHIFTBITS_DP64 + +# npi2 = (int)(x * twobypi + 0.5); + addsd %xmm5, %xmm2 # npi2 + + movsd .L__real_3ff921fb54400000(%rip), %xmm3 # piby2_1 + cvttpd2dq %xmm2, %xmm0 # convert to integer + movsd .L__real_3dd0b4611a626331(%rip), %xmm1 # piby2_1tail + cvtdq2pd %xmm0, %xmm2 # and back to float. + +# /* Subtract the multiple from x to get an extra-precision remainder */ +# rhead = x - npi2 * piby2_1; + mulsd %xmm2, %xmm3 + subsd %xmm3, %xmm4 # rhead + +# rtail = npi2 * piby2_1tail; + mulsd %xmm2, %xmm1 + movd %xmm0, %eax + +# GET_BITS_DP64(rhead-rtail, uy); + movsd %xmm4, %xmm0 + subsd %xmm1, %xmm0 + + movsd .L__real_3dd0b4611a600000(%rip), %xmm3 # piby2_2 + movsd %xmm0,p_temp(%rsp) + movsd .L__real_3ba3198a2e037073(%rip), %xmm5 # piby2_2tail + mov p_temp(%rsp), %rcx # rcx is rhead-rtail + +# xmm0=r, xmm4=rhead, xmm1=rtail, xmm2=npi2, xmm3=temp for calc, xmm5= temp for calc +# expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + shl $1, %rcx # strip any sign bit + shr $53, %rcx # >> EXPSHIFTBITS_DP64 +1 + sub %rcx, %r9 # expdiff + +## if (expdiff > 15) + cmp $15, %r9 + jle .Lexpdiffless15 + +# /* The remainder is pretty small compared with x, which +# implies that x is a near multiple of pi/2 +# (x matches the multiple to at least 15 bits) */ + +# t = rhead; + movsd %xmm4, %xmm1 + +# rtail = npi2 * piby2_2; + mulsd %xmm2, %xmm3 + +# rhead = t - rtail; + mulsd %xmm2, %xmm5 # npi2 * piby2_2tail + subsd %xmm3, %xmm4 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + subsd %xmm4, %xmm1 # t - rhead + subsd %xmm3, %xmm1 # -rtail + subsd %xmm1, %xmm5 # rtail + +# r = rhead - rtail; + movsd %xmm4, %xmm0 + +#HARSHA +#xmm1=rtail + movsd %xmm5, %xmm1 + subsd %xmm5, %xmm0 + +# xmm0=r, xmm4=rhead, xmm1=rtail +.Lexpdiffless15: +# region = npi2 & 3; + + subsd %xmm0, %xmm4 # rhead-r + subsd %xmm1, %xmm4 # rr = (rhead-r) - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +## if the input was close to a pi/2 multiple +# The original NAG code missed this trick. If the input is very close to n*pi/2 after +# reduction, +# then the cos is ~ 1.0 , to within 53 bits, when r is < 2^-27. We already +# have x at this point, so we can skip the cos polynomials. + + cmp $0x03f2, %rcx # if r small. + jge .Lcos_piby4 # use taylor series if not + cmp $0x03de, %rcx # if r really small. + jle .Lr_small # then cos(r) = 1 + + movsd %xmm0, %xmm2 + mulsd %xmm2, %xmm2 # x^2 + +## if region is 1 or 3 do a sin calc. + and %eax, %r8d + jz .Lsinsmall + +# region 1 or 3 +# use simply polynomial +# *s = x - x*x*x*0.166666666666666666; + movsd .L__real_3fc5555555555555(%rip), %xmm3 + mulsd %xmm0, %xmm3 # * x + mulsd %xmm2, %xmm3 # * x^2 + subsd %xmm3, %xmm0 # xs + jmp .Ladjust_region + +.align 16 +.Lsinsmall: +# region 0 or 2 +# cos = 1.0 - x*x*0.5; + movsd .L__real_3ff0000000000000(%rip), %xmm0 # 1.0 + mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 *x^2 + subsd %xmm2, %xmm0 + jmp .Ladjust_region + +.align 16 +.Lr_small: +## if region is 1 or 3 do a sin calc. + and %eax, %r8d + jnz .Ladjust_region + + movsd .L__real_3ff0000000000000(%rip), %xmm0 # cos(r) is a 1 + jmp .Ladjust_region + +.align 32 +.Lcos_reduce_precise: +# // Reduce x into range [-pi/4,pi/4] +# __amd_remainder_piby2(x, &r, &rr, ®ion); + + lea region(%rsp), %rdx + lea rr(%rsp), %rsi + lea r(%rsp), %rdi + + call __amd_remainder_piby2@PLT + + mov $1, %r8d # for determining region later on + movsd r(%rsp), %xmm0 # x + movsd rr(%rsp), %xmm4 # xx + mov region(%rsp), %eax # region + +# xmm0 = x, xmm4 = xx, r8d = 1, eax= region +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 32 +# perform taylor series to calc sinx, cosx +.Lcos_piby4: +# x2 = r * r; + +#xmm4 = a part of rr for the sin path, xmm4 is overwritten in the cos path +#instead use xmm3 because that was freed up in the sin path, xmm3 is overwritten in sin path + movsd %xmm0, %xmm3 + movsd %xmm0, %xmm2 + mulsd %xmm0, %xmm2 # x2 + +## if region is 1 or 3 do a sin calc. + and %eax, %r8d + jz .Lcospiby4 + +# region 1 or 3 + movsd .Lsinarray+0x50(%rip), %xmm3 # s6 + mulsd %xmm2, %xmm3 # x2s6 + movsd .Lsinarray+0x20(%rip), %xmm5 # s3 + movsd %xmm4,p_temp(%rsp) # store xx + movsd %xmm2, %xmm1 # move for x4 + mulsd %xmm2, %xmm1 # x4 + movsd %xmm0,p_temp1(%rsp) # store x + mulsd %xmm2, %xmm5 # x2s3 + movsd %xmm0, %xmm4 # move for x3 + addsd .Lsinarray+0x40(%rip), %xmm3 # s5+x2s6 + mulsd %xmm2, %xmm1 # x6 + mulsd %xmm2, %xmm3 # x2(s5+x2s6) + mulsd %xmm2, %xmm4 # x3 + addsd .Lsinarray+0x10(%rip), %xmm5 # s2+x2s3 + mulsd %xmm2, %xmm5 # x2(s2+x2s3) + addsd .Lsinarray+0x30(%rip), %xmm3 # s4 + x2(s5+x2s6) + mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 *x2 + movsd p_temp(%rsp), %xmm0 # load xx + mulsd %xmm1, %xmm3 # x6(s4 + x2(s5+x2s6)) + addsd .Lsinarray(%rip), %xmm5 # s1+x2(s2+x2s3) + mulsd %xmm0, %xmm2 # 0.5 * x2 *xx + addsd %xmm5, %xmm3 # zs + mulsd %xmm3, %xmm4 # *x3 + subsd %xmm2, %xmm4 # x3*zs - 0.5 * x2 *xx + addsd %xmm4, %xmm0 # +xx + addsd p_temp1(%rsp), %xmm0 # +x + + jmp .Ladjust_region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcospiby4: + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# region 0 or 2 - do a cos calculation +# zc = (c2 + x2 * (c3 + x2 * (c4 + x2 * (c5 + x2 * c6)))); + mulsd %xmm0, %xmm4 # x*xx + movsd .L__real_3fe0000000000000(%rip), %xmm5 + movsd .Lcosarray+0x50(%rip), %xmm1 # c6 + movsd .Lcosarray+0x20(%rip), %xmm0 # c3 + mulsd %xmm2, %xmm5 # r = 0.5 *x2 + movsd %xmm2, %xmm3 # copy of x2 + movsd %xmm4,p_temp(%rsp) # store x*xx + mulsd %xmm2, %xmm1 # c6*x2 + mulsd %xmm2, %xmm0 # c3*x2 + subsd .L__real_3ff0000000000000(%rip), %xmm5 # -t=r-1.0 ;trash r + mulsd %xmm2, %xmm3 # x4 + addsd .Lcosarray+0x40(%rip), %xmm1 # c5+x2c6 + addsd .Lcosarray+0x10(%rip), %xmm0 # c2+x2C3 + addsd .L__real_3ff0000000000000(%rip), %xmm5 # 1 + (-t) ;trash t + mulsd %xmm2, %xmm3 # x6 + mulsd %xmm2, %xmm1 # x2(c5+x2c6) + mulsd %xmm2, %xmm0 # x2(c2+x2C3) + movsd %xmm2, %xmm4 # copy of x2 + mulsd .L__real_3fe0000000000000(%rip), %xmm4 # r recalculate + addsd .Lcosarray+0x30(%rip), %xmm1 # c4 + x2(c5+x2c6) + addsd .Lcosarray(%rip), %xmm0 # c1+x2(c2+x2C3) + mulsd %xmm2, %xmm2 # x4 recalculate + subsd %xmm4, %xmm5 # (1 + (-t)) - r + mulsd %xmm3, %xmm1 # x6(c4 + x2(c5+x2c6)) + addsd %xmm1, %xmm0 # zc + subsd .L__real_3ff0000000000000(%rip), %xmm4 # t relaculate + subsd p_temp(%rsp), %xmm5 # ((1 + (-t)) - r) - x*xx + mulsd %xmm2, %xmm0 # x4 * zc + addsd %xmm5, %xmm0 # x4 * zc + ((1 + (-t)) - r -x*xx) + subsd %xmm4, %xmm0 # result - (-t) + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +.align 32 +.Ladjust_region: # positive or negative (0, 1, 2, 3)=>(1, 2, 3 ,4)=>(0, 2, 2,0) +# switch (region) + add $1, %eax + and $2, %eax + jz .Lcos_cleanup +## if the original region 1 or 2 then we negate the result. + movsd %xmm0, %xmm2 + xorpd %xmm0, %xmm0 + subsd %xmm2, %xmm0 + +.align 32 +.Lcos_cleanup: + add $stack_size, %rsp + ret + +.align 32 +.Lcos_naninf: + call fname_special + add $stack_size, %rsp + ret + + +
diff --git a/src/gas/cosf.S b/src/gas/cosf.S new file mode 100644 index 0000000..43eae9a --- /dev/null +++ b/src/gas/cosf.S
@@ -0,0 +1,372 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# An implementation of the cosf function. +# +# Prototype: +# +# float fastcosf(float x); +# +# Computes cosf(x). +# Based on the NAG C implementation. +# It will provide proper C99 return values, +# but may not raise floating point status bits properly. +# Author: Harsha Jagasia +# Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 32 +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0 # for alignment +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0 +.L__real_3FF921FB54442D18: .quad 0x03FF921FB54442D18 # piby2 + .quad 0 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0 +.L__real_411E848000000000: .quad 0x415312d000000000 # 5e6 0x0411E848000000000 # 5e5 + .quad 0 + +.align 32 +.Lcsarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x0bf56c16c16c16967 # -0.00138889 c2 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4 + +.text +.align 32 +.p2align 5,,31 + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(cosf) +#define fname_special _cosf_special@PLT + +# define local variable storage offsets +.equ p_temp, 0x30 # temporary for get/put bits operation +.equ p_temp1, 0x40 # temporary for get/put bits operation +.equ region, 0x50 # pointer to region for amd_remainder_piby2 +.equ r, 0x60 # pointer to r for amd_remainder_piby2 +.equ stack_size, 0x88 + +.globl fname +.type fname,@function + +fname: + + sub $stack_size, %rsp + +## if NaN or inf + movd %xmm0, %edx + mov $0x07f800000, %eax + mov %eax, %r10d + and %edx, %r10d + cmp %eax, %r10d + jz .Lcosf_naninf + + xorpd %xmm2, %xmm2 + mov %rdx, %r11 # save 1st return value pointer + +# GET_BITS_DP64(x, ux); +# convert input to double. + cvtss2sd %xmm0, %xmm0 + +# get the input value to an integer register. + movsd %xmm0,p_temp(%rsp) + mov p_temp(%rsp), %rdx # rdx is ux + +# ax = (ux & ~SIGNBIT_DP64); + mov $0x07fffffffffffffff, %r10 + and %rdx, %r10 # r10 is ax + + mov $1, %r8d # for determining region later on + movsd %xmm0, %xmm1 # copy x to xmm1 + + +## if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */ + mov $0x03fe921fb54442d18, %rax + cmp %rax, %r10 + jg .L__sc_reducec + +# *c = cos_piby4(x, 0.0); + movsd %xmm0, %xmm2 + mulsd %xmm2, %xmm2 # x^2 + xor %eax, %eax + mov %r10, %rdx + movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5 + jmp .L__sc_piby4c + +.align 32 +.L__sc_reducec: +# reduce the argument to be in a range from -pi/4 to +pi/4 +# by subtracting multiples of pi/2 +# xneg = (ax != ux); + cmp %r10, %rdx +## if (xneg) x = -x; + jz .Lpositive + subsd %xmm0, %xmm2 + movsd %xmm2, %xmm0 + +.Lpositive: +## if (x < 5.0e5) + cmp .L__real_411E848000000000(%rip), %r10 + jae .Lcosf_reduce_precise + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +# perform taylor series to calc cosx, cosx +# xmm0=abs(x), xmm1=x +.align 32 +.Lcosf_piby4: +#/* How many pi/2 is x a multiple of? */ +# npi2 = (int)(x * twobypi + 0.5); + + movsd %xmm0, %xmm2 + movsd %xmm0, %xmm4 + + mulsd .L__real_3fe45f306dc9c883(%rip), %xmm2 # twobypi + movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5 + +#/* How many pi/2 is x a multiple of? */ + +# xexp = ax >> EXPSHIFTBITS_DP64; + mov %r10, %r9 + shr $52, %r9 # >> EXPSHIFTBITS_DP64 + +# npi2 = (int)(x * twobypi + 0.5); + addsd %xmm5, %xmm2 # npi2 + + movsd .L__real_3ff921fb54400000(%rip), %xmm3 # piby2_1 + cvttpd2dq %xmm2, %xmm0 # convert to integer + movsd .L__real_3dd0b4611a626331(%rip), %xmm1 # piby2_1tail + cvtdq2pd %xmm0, %xmm2 # and back to double + +# /* Subtract the multiple from x to get an extra-precision remainder */ +# rhead = x - npi2 * piby2_1; + + mulsd %xmm2, %xmm3 # use piby2_1 + subsd %xmm3, %xmm4 # rhead + +# rtail = npi2 * piby2_1tail; + mulsd %xmm2, %xmm1 # rtail + movd %xmm0, %eax + +# GET_BITS_DP64(rhead-rtail, uy); + movsd %xmm4, %xmm0 + subsd %xmm1, %xmm0 + + movsd .L__real_3dd0b4611a600000(%rip), %xmm3 # piby2_2 + movsd .L__real_3ba3198a2e037073(%rip), %xmm5 # piby2_2tail + movd %xmm0, %rcx # rcx is rhead-rtail + +# expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + shl $1, %rcx # strip any sign bit + shr $53, %rcx # >> EXPSHIFTBITS_DP64 +1 + sub %rcx, %r9 # expdiff + +## if (expdiff > 15) + cmp $15, %r9 + jle .Lexpdiffless15 + +# /* The remainder is pretty small compared with x, which +# implies that x is a near multiple of pi/2 +# (x matches the multiple to at least 15 bits) */ + +# t = rhead; + movsd %xmm4, %xmm1 + +# rtail = npi2 * piby2_2; + mulsd %xmm2, %xmm3 + +# rhead = t - rtail; + mulsd %xmm2, %xmm5 # npi2 * piby2_2tail + subsd %xmm3, %xmm4 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + subsd %xmm4, %xmm1 # t - rhead + subsd %xmm3, %xmm1 # -rtail + subsd %xmm1, %xmm5 # rtail + +# r = rhead - rtail; + movsd %xmm4, %xmm0 + +#HARSHA +#xmm1=rtail + movsd %xmm5, %xmm1 + subsd %xmm5, %xmm0 + +# xmm0=r, xmm4=rhead, xmm1=rtail +.Lexpdiffless15: +# region = npi2 & 3; + + movsd %xmm0, %xmm2 + mulsd %xmm0, %xmm2 #x^2 + movsd %xmm0, %xmm1 + movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5 + + cmp $0x03f2, %rcx # if r small. + jge .L__sc_piby4c # use taylor series if not + cmp $0x03de, %rcx # if r really small. + jle .L__rc_small # then cos(r) = 1 + +## if region is 1 or 3 do a sin calc. + and %eax, %r8d + jz .Lsinsmall +# region 1 or 3 +# use simply polynomial +# *s = x - x*x*x*0.166666666666666666; + movsd .L__real_3fc5555555555555(%rip), %xmm3 + mulsd %xmm1, %xmm3 # * x + mulsd %xmm2, %xmm3 # * x^2 + subsd %xmm3, %xmm1 # xs + jmp .L__adjust_region_cos + +.align 16 +.Lsinsmall: +# region 0 or 2 +# cos = 1.0 - x*x*0.5; + movsd .L__real_3ff0000000000000(%rip), %xmm1 # 1.0 + mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 *x^2 + subsd %xmm2, %xmm1 + jmp .L__adjust_region_cos + +.align 16 +.L__rc_small: # then sin(r) = r +## if region is 1 or 3 do a sin calc. + and %eax, %r8d + jnz .L__adjust_region_cos + movsd .L__real_3ff0000000000000(%rip), %xmm1 # cos(r) is a 1 + jmp .L__adjust_region_cos + + +# done with reducing the argument. Now perform the sin/cos calculations. +.align 16 +.L__sc_piby4c: +## if region is 1 or 3 do a sin calc. + and %eax, %r8d + jz .Lcospiby4 + + movsd .Lcsarray+0x30(%rip), %xmm1 # c4 + movsd %xmm2, %xmm4 + mulsd %xmm2, %xmm1 # x2c4 + movsd .Lcsarray+0x10(%rip), %xmm3 # c2 + mulsd %xmm4, %xmm4 # x4 + mulsd %xmm2, %xmm3 # x2c2 + mulsd %xmm0, %xmm2 # x3 + addsd .Lcsarray+0x20(%rip), %xmm1 # c3 + x2c4 + mulsd %xmm4, %xmm1 # x4(c3 + x2c4) + addsd .Lcsarray(%rip), %xmm3 # c1 + x2c2 + addsd %xmm3, %xmm1 # c1 + c2x2 + c3x4 + c4x6 + mulsd %xmm2, %xmm1 # c1x3 + c2x5 + c3x7 + c4x9 + addsd %xmm0, %xmm1 # x + c1x3 + c2x5 + c3x7 + c4x9 + + jmp .L__adjust_region_cos + +.align 16 +.Lcospiby4: +# region 0 or 2 - do a cos calculation + movsd .Lcsarray+0x38(%rip), %xmm1 # c4 + movsd %xmm2, %xmm4 + mulsd %xmm2, %xmm1 # x2c4 + movsd .Lcsarray+0x18(%rip), %xmm3 # c2 + mulsd %xmm4, %xmm4 # x4 + mulsd %xmm2, %xmm3 # x2c2 + mulsd %xmm2, %xmm5 # 0.5 * x2 + addsd .Lcsarray+0x28(%rip), %xmm1 # c3 + x2c4 + mulsd %xmm4, %xmm1 # x4(c3 + x2c4) + addsd .Lcsarray+8(%rip), %xmm3 # c1 + x2c2 + addsd %xmm3, %xmm1 # c1 + x2c2 + c3x4 + c4x6 + mulsd %xmm4, %xmm1 # x4(c1 + c2x2 + c3x4 + c4x6) + +# -t = rc-1; + subsd .L__real_3ff0000000000000(%rip), %xmm5 # 0.5x2 - 1 + subsd %xmm5, %xmm1 # cos = 1 - 0.5x2 + c1x4 + c2x6 + c3x8 + c4x10 + +.L__adjust_region_cos: # xmm1 is cos or sin, relies on previous sections to +# switch (region) + add $1, %eax + and $2, %eax + jz .L__cos_cleanup +## if region 1 or 2 then we negate the result. + xorpd %xmm2, %xmm2 + subsd %xmm1, %xmm2 + movsd %xmm2, %xmm1 + +.align 16 +.L__cos_cleanup: + cvtsd2ss %xmm1, %xmm0 + add $stack_size, %rsp + ret + +.align 16 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lcosf_reduce_precise: +# /* Reduce abs(x) into range [-pi/4,pi/4] */ +# __amd_remainder_piby2(ax, &r, ®ion); + + mov %rdx,p_temp(%rsp) # save ux for use later + mov %r10,p_temp1(%rsp) # save ax for use later + movd %xmm0, %rdi + lea r(%rsp), %rsi + lea region(%rsp), %rdx + sub $0x020, %rsp + + call __amd_remainder_piby2d2f@PLT + + add $0x020, %rsp + mov p_temp(%rsp), %rdx # restore ux for use later + mov p_temp1(%rsp), %r10 # restore ax for use later + mov $1, %r8d # for determining region later on + movsd r(%rsp), %xmm0 # r + mov region(%rsp), %eax # region + + movsd %xmm0, %xmm2 + mulsd %xmm0, %xmm2 # x^2 + movsd %xmm0, %xmm1 + movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5 + + jmp .L__sc_piby4c + +.align 32 +.Lcosf_naninf: + call fname_special + add $stack_size, %rsp + ret
diff --git a/src/gas/exp.S b/src/gas/exp.S new file mode 100644 index 0000000..153e8a6 --- /dev/null +++ b/src/gas/exp.S
@@ -0,0 +1,400 @@ +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +#ifdef __x86_64__ +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# +# exp.S +# +# An implementation of the exp libm function. +# +# Prototype: +# +# double exp(double x); +# + +# +# Algorithm: +# +# e^x = 2^(x/ln(2)) = 2^(x*(64/ln(2))/64) +# +# x*(64/ln(2)) = n + f, |f| <= 0.5, n is integer +# n = 64*m + j, 0 <= j < 64 +# +# e^x = 2^((64*m + j + f)/64) +# = (2^m) * (2^(j/64)) * 2^(f/64) +# = (2^m) * (2^(j/64)) * e^(f*(ln(2)/64)) +# +# f = x*(64/ln(2)) - n +# r = f*(ln(2)/64) = x - n*(ln(2)/64) +# +# e^x = (2^m) * (2^(j/64)) * e^r +# +# (2^(j/64)) is precomputed +# +# e^r = 1 + r + (r^2)/2! + (r^3)/3! + (r^4)/4! + (r^5)/5! + (r^5)/5! +# e^r = 1 + q +# +# q = r + (r^2)/2! + (r^3)/3! + (r^4)/4! + (r^5)/5! + (r^5)/5! +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(exp) +#define fname_special _exp_special@PLT + +.text +.p2align 4 +.globl fname +.type fname,@function +fname: + ucomisd .L__max_exp_arg(%rip), %xmm0 + ja .L__y_is_inf + jp .L__y_is_nan + ucomisd .L__denormal_tiny_threshold(%rip), %xmm0 + jbe .L__y_is_zero + + # x * (64/ln(2)) + movapd %xmm0,%xmm1 + mulsd .L__real_64_by_log2(%rip), %xmm1 + + # n = int( x * (64/ln(2)) ) + cvttpd2dq %xmm1, %xmm2 #xmm2 = (int)n + cvtdq2pd %xmm2, %xmm1 #xmm1 = (double)n + movd %xmm2, %ecx + movapd %xmm1,%xmm2 + # r1 = x - n * ln(2)/64 head + mulsd .L__log2_by_64_mhead(%rip),%xmm1 + + #j = n & 0x3f + mov $0x3f, %rax + and %ecx, %eax #eax = j + # m = (n - j) / 64 + sar $6, %ecx #ecx = m + + + # r2 = - n * ln(2)/64 tail + mulsd .L__log2_by_64_mtail(%rip),%xmm2 + addsd %xmm1,%xmm0 #xmm0 = r1 + + # r1+r2 + addsd %xmm0, %xmm2 #xmm2 = r + + # q = r + r^2*1/2 + r^3*1/6 + r^4 *1/24 + r^5*1/120 + r^6*1/720 + # q = r + r*r*(1/2 + r*(1/6+ r*(1/24 + r*(1/120 + r*(1/720))))) + movapd .L__real_1_by_720(%rip), %xmm3 #xmm3 = 1/720 + mulsd %xmm2, %xmm3 #xmm3 = r*1/720 + movapd .L__real_1_by_6(%rip), %xmm0 #xmm0 = 1/6 + movapd %xmm2, %xmm1 #xmm1 = r + mulsd %xmm2, %xmm0 #xmm0 = r*1/6 + addsd .L__real_1_by_120(%rip), %xmm3 #xmm3 = 1/120 + (r*1/720) + mulsd %xmm2, %xmm1 #xmm1 = r*r + addsd .L__real_1_by_2(%rip), %xmm0 #xmm0 = 1/2 + (r*1/6) + movapd %xmm1, %xmm4 #xmm4 = r*r + mulsd %xmm1, %xmm4 #xmm4 = (r*r) * (r*r) + mulsd %xmm2, %xmm3 #xmm3 = r * (1/120 + (r*1/720)) + mulsd %xmm1, %xmm0 #xmm0 = (r*r)*(1/2 + (r*1/6)) + addsd .L__real_1_by_24(%rip), %xmm3 #xmm3 = 1/24 + (r * (1/120 + (r*1/720))) + addsd %xmm2, %xmm0 #xmm0 = r + ((r*r)*(1/2 + (r*1/6))) + mulsd %xmm4, %xmm3 #xmm3 = ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720)))) + addsd %xmm3, %xmm0 #xmm0 = r + ((r*r)*(1/2 + (r*1/6))) + ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720)))) + + # (f)*(q) + f2 + f1 + cmp $0xfffffc02, %ecx # -1022 + lea .L__two_to_jby64_table(%rip), %rdx + lea .L__two_to_jby64_tail_table(%rip), %r11 + lea .L__two_to_jby64_head_table(%rip), %r10 + mulsd (%rdx,%rax,8), %xmm0 + addsd (%r11,%rax,8), %xmm0 + addsd (%r10,%rax,8), %xmm0 + + jle .L__process_denormal +.L__process_normal: + shl $52, %rcx + movd %rcx,%xmm2 + paddq %xmm2, %xmm0 + ret + +.p2align 4 +.L__process_denormal: + jl .L__process_true_denormal + ucomisd .L__real_one(%rip), %xmm0 + jae .L__process_normal +.L__process_true_denormal: + # here ( e^r < 1 and m = -1022 ) or m <= -1023 + add $1074, %ecx + mov $1, %rax + shl %cl, %rax + movd %rax, %xmm2 + mulsd %xmm2, %xmm0 + ret + +.p2align 4 +.L__y_is_inf: + mov $0x7ff0000000000000,%rax + movd %rax, %xmm1 + mov $3, %edi + jmp fname_special + +.p2align 4 +.L__y_is_nan: + movapd %xmm0,%xmm1 + addsd %xmm0,%xmm1 + mov $1, %edi + jmp fname_special + +.p2align 4 +.L__y_is_zero: + ucomisd .L__min_exp_arg(%rip),%xmm0 + jbe .L__return_zero + movapd .L__real_smallest_denormal(%rip), %xmm0 + ret + +.p2align 4 +.L__return_zero: + pxor %xmm1,%xmm1 + mov $2, %edi + jmp fname_special + +.data +.align 16 +.L__max_exp_arg: .quad 0x40862e42fefa39ef +.L__denormal_tiny_threshold: .quad 0xc0874046dfefd9d0 +.L__min_exp_arg: .quad 0xc0874910d52d3051 +.L__real_64_by_log2: .quad 0x40571547652b82fe # 64/ln(2) + +.align 16 +.L__log2_by_64_mhead: .quad 0xbf862e42fefa0000 +.L__log2_by_64_mtail: .quad 0xbd1cf79abc9e3b39 +.L__real_1_by_720: .quad 0x3f56c16c16c16c17 # 1/720 +.L__real_1_by_120: .quad 0x3f81111111111111 # 1/120 +.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6 +.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2 +.L__real_1_by_24: .quad 0x3fa5555555555555 # 1/24 +.L__real_one: .quad 0x3ff0000000000000 +.L__real_smallest_denormal: .quad 0x0000000000000001 + + +.align 16 +.L__two_to_jby64_table: + .quad 0x3ff0000000000000 + .quad 0x3ff02c9a3e778061 + .quad 0x3ff059b0d3158574 + .quad 0x3ff0874518759bc8 + .quad 0x3ff0b5586cf9890f + .quad 0x3ff0e3ec32d3d1a2 + .quad 0x3ff11301d0125b51 + .quad 0x3ff1429aaea92de0 + .quad 0x3ff172b83c7d517b + .quad 0x3ff1a35beb6fcb75 + .quad 0x3ff1d4873168b9aa + .quad 0x3ff2063b88628cd6 + .quad 0x3ff2387a6e756238 + .quad 0x3ff26b4565e27cdd + .quad 0x3ff29e9df51fdee1 + .quad 0x3ff2d285a6e4030b + .quad 0x3ff306fe0a31b715 + .quad 0x3ff33c08b26416ff + .quad 0x3ff371a7373aa9cb + .quad 0x3ff3a7db34e59ff7 + .quad 0x3ff3dea64c123422 + .quad 0x3ff4160a21f72e2a + .quad 0x3ff44e086061892d + .quad 0x3ff486a2b5c13cd0 + .quad 0x3ff4bfdad5362a27 + .quad 0x3ff4f9b2769d2ca7 + .quad 0x3ff5342b569d4f82 + .quad 0x3ff56f4736b527da + .quad 0x3ff5ab07dd485429 + .quad 0x3ff5e76f15ad2148 + .quad 0x3ff6247eb03a5585 + .quad 0x3ff6623882552225 + .quad 0x3ff6a09e667f3bcd + .quad 0x3ff6dfb23c651a2f + .quad 0x3ff71f75e8ec5f74 + .quad 0x3ff75feb564267c9 + .quad 0x3ff7a11473eb0187 + .quad 0x3ff7e2f336cf4e62 + .quad 0x3ff82589994cce13 + .quad 0x3ff868d99b4492ed + .quad 0x3ff8ace5422aa0db + .quad 0x3ff8f1ae99157736 + .quad 0x3ff93737b0cdc5e5 + .quad 0x3ff97d829fde4e50 + .quad 0x3ff9c49182a3f090 + .quad 0x3ffa0c667b5de565 + .quad 0x3ffa5503b23e255d + .quad 0x3ffa9e6b5579fdbf + .quad 0x3ffae89f995ad3ad + .quad 0x3ffb33a2b84f15fb + .quad 0x3ffb7f76f2fb5e47 + .quad 0x3ffbcc1e904bc1d2 + .quad 0x3ffc199bdd85529c + .quad 0x3ffc67f12e57d14b + .quad 0x3ffcb720dcef9069 + .quad 0x3ffd072d4a07897c + .quad 0x3ffd5818dcfba487 + .quad 0x3ffda9e603db3285 + .quad 0x3ffdfc97337b9b5f + .quad 0x3ffe502ee78b3ff6 + .quad 0x3ffea4afa2a490da + .quad 0x3ffefa1bee615a27 + .quad 0x3fff50765b6e4540 + .quad 0x3fffa7c1819e90d8 + +.align 16 +.L__two_to_jby64_head_table: + .quad 0x3ff0000000000000 + .quad 0x3ff02c9a30000000 + .quad 0x3ff059b0d0000000 + .quad 0x3ff0874510000000 + .quad 0x3ff0b55860000000 + .quad 0x3ff0e3ec30000000 + .quad 0x3ff11301d0000000 + .quad 0x3ff1429aa0000000 + .quad 0x3ff172b830000000 + .quad 0x3ff1a35be0000000 + .quad 0x3ff1d48730000000 + .quad 0x3ff2063b80000000 + .quad 0x3ff2387a60000000 + .quad 0x3ff26b4560000000 + .quad 0x3ff29e9df0000000 + .quad 0x3ff2d285a0000000 + .quad 0x3ff306fe00000000 + .quad 0x3ff33c08b0000000 + .quad 0x3ff371a730000000 + .quad 0x3ff3a7db30000000 + .quad 0x3ff3dea640000000 + .quad 0x3ff4160a20000000 + .quad 0x3ff44e0860000000 + .quad 0x3ff486a2b0000000 + .quad 0x3ff4bfdad0000000 + .quad 0x3ff4f9b270000000 + .quad 0x3ff5342b50000000 + .quad 0x3ff56f4730000000 + .quad 0x3ff5ab07d0000000 + .quad 0x3ff5e76f10000000 + .quad 0x3ff6247eb0000000 + .quad 0x3ff6623880000000 + .quad 0x3ff6a09e60000000 + .quad 0x3ff6dfb230000000 + .quad 0x3ff71f75e0000000 + .quad 0x3ff75feb50000000 + .quad 0x3ff7a11470000000 + .quad 0x3ff7e2f330000000 + .quad 0x3ff8258990000000 + .quad 0x3ff868d990000000 + .quad 0x3ff8ace540000000 + .quad 0x3ff8f1ae90000000 + .quad 0x3ff93737b0000000 + .quad 0x3ff97d8290000000 + .quad 0x3ff9c49180000000 + .quad 0x3ffa0c6670000000 + .quad 0x3ffa5503b0000000 + .quad 0x3ffa9e6b50000000 + .quad 0x3ffae89f90000000 + .quad 0x3ffb33a2b0000000 + .quad 0x3ffb7f76f0000000 + .quad 0x3ffbcc1e90000000 + .quad 0x3ffc199bd0000000 + .quad 0x3ffc67f120000000 + .quad 0x3ffcb720d0000000 + .quad 0x3ffd072d40000000 + .quad 0x3ffd5818d0000000 + .quad 0x3ffda9e600000000 + .quad 0x3ffdfc9730000000 + .quad 0x3ffe502ee0000000 + .quad 0x3ffea4afa0000000 + .quad 0x3ffefa1be0000000 + .quad 0x3fff507650000000 + .quad 0x3fffa7c180000000 + +.align 16 +.L__two_to_jby64_tail_table: + .quad 0x0000000000000000 + .quad 0x3e6cef00c1dcdef9 + .quad 0x3e48ac2ba1d73e2a + .quad 0x3e60eb37901186be + .quad 0x3e69f3121ec53172 + .quad 0x3e469e8d10103a17 + .quad 0x3df25b50a4ebbf1a + .quad 0x3e6d525bbf668203 + .quad 0x3e68faa2f5b9bef9 + .quad 0x3e66df96ea796d31 + .quad 0x3e368b9aa7805b80 + .quad 0x3e60c519ac771dd6 + .quad 0x3e6ceac470cd83f5 + .quad 0x3e5789f37495e99c + .quad 0x3e547f7b84b09745 + .quad 0x3e5b900c2d002475 + .quad 0x3e64636e2a5bd1ab + .quad 0x3e4320b7fa64e430 + .quad 0x3e5ceaa72a9c5154 + .quad 0x3e53967fdba86f24 + .quad 0x3e682468446b6824 + .quad 0x3e3f72e29f84325b + .quad 0x3e18624b40c4dbd0 + .quad 0x3e5704f3404f068e + .quad 0x3e54d8a89c750e5e + .quad 0x3e5a74b29ab4cf62 + .quad 0x3e5a753e077c2a0f + .quad 0x3e5ad49f699bb2c0 + .quad 0x3e6a90a852b19260 + .quad 0x3e56b48521ba6f93 + .quad 0x3e0d2ac258f87d03 + .quad 0x3e42a91124893ecf + .quad 0x3e59fcef32422cbe + .quad 0x3e68ca345de441c5 + .quad 0x3e61d8bee7ba46e1 + .quad 0x3e59099f22fdba6a + .quad 0x3e4f580c36bea881 + .quad 0x3e5b3d398841740a + .quad 0x3e62999c25159f11 + .quad 0x3e668925d901c83b + .quad 0x3e415506dadd3e2a + .quad 0x3e622aee6c57304e + .quad 0x3e29b8bc9e8a0387 + .quad 0x3e6fbc9c9f173d24 + .quad 0x3e451f8480e3e235 + .quad 0x3e66bbcac96535b5 + .quad 0x3e41f12ae45a1224 + .quad 0x3e55e7f6fd0fac90 + .quad 0x3e62b5a75abd0e69 + .quad 0x3e609e2bf5ed7fa1 + .quad 0x3e47daf237553d84 + .quad 0x3e12f074891ee83d + .quad 0x3e6b0aa538444196 + .quad 0x3e6cafa29694426f + .quad 0x3e69df20d22a0797 + .quad 0x3e640f12f71a1e45 + .quad 0x3e69f7490e4bb40b + .quad 0x3e4ed9942b84600d + .quad 0x3e4bdcdaf5cb4656 + .quad 0x3e5e2cffd89cf44c + .quad 0x3e452486cc2c7b9d + .quad 0x3e6cc2b44eee3fa4 + .quad 0x3e66dc8a80ce9f09 + .quad 0x3e39e90d82e90a7e + +#endif
diff --git a/src/gas/exp10.S b/src/gas/exp10.S new file mode 100644 index 0000000..009bbe0 --- /dev/null +++ b/src/gas/exp10.S
@@ -0,0 +1,366 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(exp10) +#define fname_special _exp10_special@PLT +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.p2align 4 +.globl fname +.type fname,@function +fname: + ucomisd .L__max_exp10_arg(%rip), %xmm0 + jae .L__y_is_inf + jp .L__y_is_nan + ucomisd .L__min_exp10_arg(%rip), %xmm0 + jbe .L__y_is_zero + + # x * (64/log10(2)) + movapd %xmm0,%xmm1 + mulsd .L__real_64_by_log10of2(%rip), %xmm1 + + # n = int( x * (64/log10(2)) ) + cvttpd2dq %xmm1, %xmm2 #xmm2 = (int)n + cvtdq2pd %xmm2, %xmm1 #xmm1 = (double)n + movd %xmm2, %ecx + movapd %xmm1,%xmm2 + # r1 = x - n * log10(2)/64 head + mulsd .L__log10of2_by_64_mhead(%rip),%xmm1 + + #j = n & 0x3f + mov $0x3f, %rax + and %ecx, %eax #eax = j + # m = (n - j) / 64 + sar $6, %ecx #ecx = m + + # r2 = - n * log10(2)/64 tail + mulsd .L__log10of2_by_64_mtail(%rip),%xmm2 #xmm2 = r2 + addsd %xmm1,%xmm0 #xmm0 = r1 + + # r1 *= ln10; + # r2 *= ln10; + mulsd .L__ln10(%rip),%xmm0 + mulsd .L__ln10(%rip),%xmm2 + + # r1+r2 + addsd %xmm0, %xmm2 #xmm2 = r + + # q = r + r^2*1/2 + r^3*1/6 + r^4 *1/24 + r^5*1/120 + r^6*1/720 + # q = r + r*r*(1/2 + r*(1/6+ r*(1/24 + r*(1/120 + r*(1/720))))) + movapd .L__real_1_by_720(%rip), %xmm3 #xmm3 = 1/720 + mulsd %xmm2, %xmm3 #xmm3 = r*1/720 + movapd .L__real_1_by_6(%rip), %xmm0 #xmm0 = 1/6 + movapd %xmm2, %xmm1 #xmm1 = r + mulsd %xmm2, %xmm0 #xmm0 = r*1/6 + addsd .L__real_1_by_120(%rip), %xmm3 #xmm3 = 1/120 + (r*1/720) + mulsd %xmm2, %xmm1 #xmm1 = r*r + addsd .L__real_1_by_2(%rip), %xmm0 #xmm0 = 1/2 + (r*1/6) + movapd %xmm1, %xmm4 #xmm4 = r*r + mulsd %xmm1, %xmm4 #xmm4 = (r*r) * (r*r) + mulsd %xmm2, %xmm3 #xmm3 = r * (1/120 + (r*1/720)) + mulsd %xmm1, %xmm0 #xmm0 = (r*r)*(1/2 + (r*1/6)) + addsd .L__real_1_by_24(%rip), %xmm3 #xmm3 = 1/24 + (r * (1/120 + (r*1/720))) + addsd %xmm2, %xmm0 #xmm0 = r + ((r*r)*(1/2 + (r*1/6))) + mulsd %xmm4, %xmm3 #xmm3 = ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720)))) + addsd %xmm3, %xmm0 #xmm0 = r + ((r*r)*(1/2 + (r*1/6))) + ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720)))) + + # (f)*(q) + f2 + f1 + cmp $0xfffffc02, %ecx # -1022 + lea .L__two_to_jby64_table(%rip), %rdx + lea .L__two_to_jby64_tail_table(%rip), %r11 + lea .L__two_to_jby64_head_table(%rip), %r10 + mulsd (%rdx,%rax,8), %xmm0 + addsd (%r11,%rax,8), %xmm0 + addsd (%r10,%rax,8), %xmm0 + + jle .L__process_denormal +.L__process_normal: + shl $52, %rcx + movd %rcx,%xmm2 + paddq %xmm2, %xmm0 + ret + +.p2align 4 +.L__process_denormal: + jl .L__process_true_denormal + ucomisd .L__real_one(%rip), %xmm0 + jae .L__process_normal +.L__process_true_denormal: + # here ( e^r < 1 and m = -1022 ) or m <= -1023 + add $1074, %ecx + mov $1, %rax + shl %cl, %rax + movd %rax, %xmm2 + mulsd %xmm2, %xmm0 + ret + +.p2align 4 +.L__y_is_inf: + mov $0x7ff0000000000000,%rax + movd %rax, %xmm1 + mov $3, %edi + #call fname_special + movdqa %xmm1,%xmm0 #remove this if call is made + ret + +.p2align 4 +.L__y_is_nan: + movapd %xmm0,%xmm1 + addsd %xmm0,%xmm1 + mov $1, %edi + #call fname_special + movdqa %xmm1,%xmm0 #remove this if call is made + ret + +.p2align 4 +.L__y_is_zero: + pxor %xmm1,%xmm1 + mov $2, %edi + #call fname_special + movdqa %xmm1,%xmm0 #remove this if call is made + ret + +.data +.align 16 +.L__max_exp10_arg: .quad 0x40734413509f79ff +.L__min_exp10_arg: .quad 0xc07434e6420f4374 +.L__real_64_by_log10of2: .quad 0x406A934F0979A371 # 64/log10(2) +.L__ln10: .quad 0x40026BB1BBB55516 + +.align 16 +.L__log10of2_by_64_mhead: .quad 0xbF73441350000000 +.L__log10of2_by_64_mtail: .quad 0xbda3ef3fde623e25 +.L__real_1_by_720: .quad 0x3f56c16c16c16c17 # 1/720 +.L__real_1_by_120: .quad 0x3f81111111111111 # 1/120 +.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6 +.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2 +.L__real_1_by_24: .quad 0x3fa5555555555555 # 1/24 +.L__real_one: .quad 0x3ff0000000000000 + +.align 16 +.L__two_to_jby64_table: + .quad 0x3ff0000000000000 + .quad 0x3ff02c9a3e778061 + .quad 0x3ff059b0d3158574 + .quad 0x3ff0874518759bc8 + .quad 0x3ff0b5586cf9890f + .quad 0x3ff0e3ec32d3d1a2 + .quad 0x3ff11301d0125b51 + .quad 0x3ff1429aaea92de0 + .quad 0x3ff172b83c7d517b + .quad 0x3ff1a35beb6fcb75 + .quad 0x3ff1d4873168b9aa + .quad 0x3ff2063b88628cd6 + .quad 0x3ff2387a6e756238 + .quad 0x3ff26b4565e27cdd + .quad 0x3ff29e9df51fdee1 + .quad 0x3ff2d285a6e4030b + .quad 0x3ff306fe0a31b715 + .quad 0x3ff33c08b26416ff + .quad 0x3ff371a7373aa9cb + .quad 0x3ff3a7db34e59ff7 + .quad 0x3ff3dea64c123422 + .quad 0x3ff4160a21f72e2a + .quad 0x3ff44e086061892d + .quad 0x3ff486a2b5c13cd0 + .quad 0x3ff4bfdad5362a27 + .quad 0x3ff4f9b2769d2ca7 + .quad 0x3ff5342b569d4f82 + .quad 0x3ff56f4736b527da + .quad 0x3ff5ab07dd485429 + .quad 0x3ff5e76f15ad2148 + .quad 0x3ff6247eb03a5585 + .quad 0x3ff6623882552225 + .quad 0x3ff6a09e667f3bcd + .quad 0x3ff6dfb23c651a2f + .quad 0x3ff71f75e8ec5f74 + .quad 0x3ff75feb564267c9 + .quad 0x3ff7a11473eb0187 + .quad 0x3ff7e2f336cf4e62 + .quad 0x3ff82589994cce13 + .quad 0x3ff868d99b4492ed + .quad 0x3ff8ace5422aa0db + .quad 0x3ff8f1ae99157736 + .quad 0x3ff93737b0cdc5e5 + .quad 0x3ff97d829fde4e50 + .quad 0x3ff9c49182a3f090 + .quad 0x3ffa0c667b5de565 + .quad 0x3ffa5503b23e255d + .quad 0x3ffa9e6b5579fdbf + .quad 0x3ffae89f995ad3ad + .quad 0x3ffb33a2b84f15fb + .quad 0x3ffb7f76f2fb5e47 + .quad 0x3ffbcc1e904bc1d2 + .quad 0x3ffc199bdd85529c + .quad 0x3ffc67f12e57d14b + .quad 0x3ffcb720dcef9069 + .quad 0x3ffd072d4a07897c + .quad 0x3ffd5818dcfba487 + .quad 0x3ffda9e603db3285 + .quad 0x3ffdfc97337b9b5f + .quad 0x3ffe502ee78b3ff6 + .quad 0x3ffea4afa2a490da + .quad 0x3ffefa1bee615a27 + .quad 0x3fff50765b6e4540 + .quad 0x3fffa7c1819e90d8 + +.align 16 +.L__two_to_jby64_head_table: + .quad 0x3ff0000000000000 + .quad 0x3ff02c9a30000000 + .quad 0x3ff059b0d0000000 + .quad 0x3ff0874510000000 + .quad 0x3ff0b55860000000 + .quad 0x3ff0e3ec30000000 + .quad 0x3ff11301d0000000 + .quad 0x3ff1429aa0000000 + .quad 0x3ff172b830000000 + .quad 0x3ff1a35be0000000 + .quad 0x3ff1d48730000000 + .quad 0x3ff2063b80000000 + .quad 0x3ff2387a60000000 + .quad 0x3ff26b4560000000 + .quad 0x3ff29e9df0000000 + .quad 0x3ff2d285a0000000 + .quad 0x3ff306fe00000000 + .quad 0x3ff33c08b0000000 + .quad 0x3ff371a730000000 + .quad 0x3ff3a7db30000000 + .quad 0x3ff3dea640000000 + .quad 0x3ff4160a20000000 + .quad 0x3ff44e0860000000 + .quad 0x3ff486a2b0000000 + .quad 0x3ff4bfdad0000000 + .quad 0x3ff4f9b270000000 + .quad 0x3ff5342b50000000 + .quad 0x3ff56f4730000000 + .quad 0x3ff5ab07d0000000 + .quad 0x3ff5e76f10000000 + .quad 0x3ff6247eb0000000 + .quad 0x3ff6623880000000 + .quad 0x3ff6a09e60000000 + .quad 0x3ff6dfb230000000 + .quad 0x3ff71f75e0000000 + .quad 0x3ff75feb50000000 + .quad 0x3ff7a11470000000 + .quad 0x3ff7e2f330000000 + .quad 0x3ff8258990000000 + .quad 0x3ff868d990000000 + .quad 0x3ff8ace540000000 + .quad 0x3ff8f1ae90000000 + .quad 0x3ff93737b0000000 + .quad 0x3ff97d8290000000 + .quad 0x3ff9c49180000000 + .quad 0x3ffa0c6670000000 + .quad 0x3ffa5503b0000000 + .quad 0x3ffa9e6b50000000 + .quad 0x3ffae89f90000000 + .quad 0x3ffb33a2b0000000 + .quad 0x3ffb7f76f0000000 + .quad 0x3ffbcc1e90000000 + .quad 0x3ffc199bd0000000 + .quad 0x3ffc67f120000000 + .quad 0x3ffcb720d0000000 + .quad 0x3ffd072d40000000 + .quad 0x3ffd5818d0000000 + .quad 0x3ffda9e600000000 + .quad 0x3ffdfc9730000000 + .quad 0x3ffe502ee0000000 + .quad 0x3ffea4afa0000000 + .quad 0x3ffefa1be0000000 + .quad 0x3fff507650000000 + .quad 0x3fffa7c180000000 + +.align 16 +.L__two_to_jby64_tail_table: + .quad 0x0000000000000000 + .quad 0x3e6cef00c1dcdef9 + .quad 0x3e48ac2ba1d73e2a + .quad 0x3e60eb37901186be + .quad 0x3e69f3121ec53172 + .quad 0x3e469e8d10103a17 + .quad 0x3df25b50a4ebbf1a + .quad 0x3e6d525bbf668203 + .quad 0x3e68faa2f5b9bef9 + .quad 0x3e66df96ea796d31 + .quad 0x3e368b9aa7805b80 + .quad 0x3e60c519ac771dd6 + .quad 0x3e6ceac470cd83f5 + .quad 0x3e5789f37495e99c + .quad 0x3e547f7b84b09745 + .quad 0x3e5b900c2d002475 + .quad 0x3e64636e2a5bd1ab + .quad 0x3e4320b7fa64e430 + .quad 0x3e5ceaa72a9c5154 + .quad 0x3e53967fdba86f24 + .quad 0x3e682468446b6824 + .quad 0x3e3f72e29f84325b + .quad 0x3e18624b40c4dbd0 + .quad 0x3e5704f3404f068e + .quad 0x3e54d8a89c750e5e + .quad 0x3e5a74b29ab4cf62 + .quad 0x3e5a753e077c2a0f + .quad 0x3e5ad49f699bb2c0 + .quad 0x3e6a90a852b19260 + .quad 0x3e56b48521ba6f93 + .quad 0x3e0d2ac258f87d03 + .quad 0x3e42a91124893ecf + .quad 0x3e59fcef32422cbe + .quad 0x3e68ca345de441c5 + .quad 0x3e61d8bee7ba46e1 + .quad 0x3e59099f22fdba6a + .quad 0x3e4f580c36bea881 + .quad 0x3e5b3d398841740a + .quad 0x3e62999c25159f11 + .quad 0x3e668925d901c83b + .quad 0x3e415506dadd3e2a + .quad 0x3e622aee6c57304e + .quad 0x3e29b8bc9e8a0387 + .quad 0x3e6fbc9c9f173d24 + .quad 0x3e451f8480e3e235 + .quad 0x3e66bbcac96535b5 + .quad 0x3e41f12ae45a1224 + .quad 0x3e55e7f6fd0fac90 + .quad 0x3e62b5a75abd0e69 + .quad 0x3e609e2bf5ed7fa1 + .quad 0x3e47daf237553d84 + .quad 0x3e12f074891ee83d + .quad 0x3e6b0aa538444196 + .quad 0x3e6cafa29694426f + .quad 0x3e69df20d22a0797 + .quad 0x3e640f12f71a1e45 + .quad 0x3e69f7490e4bb40b + .quad 0x3e4ed9942b84600d + .quad 0x3e4bdcdaf5cb4656 + .quad 0x3e5e2cffd89cf44c + .quad 0x3e452486cc2c7b9d + .quad 0x3e6cc2b44eee3fa4 + .quad 0x3e66dc8a80ce9f09 + .quad 0x3e39e90d82e90a7e + + +
diff --git a/src/gas/exp10f.S b/src/gas/exp10f.S new file mode 100644 index 0000000..da805e2 --- /dev/null +++ b/src/gas/exp10f.S
@@ -0,0 +1,191 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(exp10f) +#define fname_special _exp10f_special@PLT + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.p2align 4 +.globl fname +.type fname,@function +fname: + ucomiss .L__max_exp_arg(%rip), %xmm0 + ja .L__y_is_inf + jp .L__y_is_nan + ucomiss .L__min_exp_arg(%rip), %xmm0 + jb .L__y_is_zero + + cvtps2pd %xmm0, %xmm0 #xmm0 = (double)x + + # x * (64/log10of(2)) + movapd %xmm0,%xmm3 #xmm3 = (xouble)x + mulsd .L__real_64_by_log10of2(%rip), %xmm3 #xmm3 = x * (64/ln(2) + + # n = int( x * (64/log10of(2)) ) + cvtpd2dq %xmm3, %xmm4 #xmm4 = (int)n + cvtdq2pd %xmm4, %xmm2 #xmm2 = (double)n + + # r = x - n * ln(2)/64 + # r *= ln(10) + mulsd .L__real_log10of2_by_64(%rip),%xmm2 #xmm2 = n * log10of(2)/64 + movd %xmm4, %ecx #ecx = n + subsd %xmm2, %xmm0 #xmm0 = r + mulsd .L__real_ln10(%rip),%xmm0 #xmm0 = r = r*ln10 + movapd %xmm0, %xmm1 #xmm1 = r + + # q = r + r*r(1/2 + r*1/6) + movapd .L__real_1_by_6(%rip), %xmm3 + mulsd %xmm0, %xmm3 #xmm3 = 1/6 * r + mulsd %xmm1, %xmm0 #xmm0 = r * r + addsd .L__real_1_by_2(%rip), %xmm3 #xmm3 = 1/2 + (1/6 * r) + mulsd %xmm3, %xmm0 #xmm0 = r*r*(1/2 + (1/6 * r)) + addsd %xmm1, %xmm0 #xmm0 = r+r*r*(1/2 + (1/6 * r)) + + #j = n & 0x3f + mov $0x3f, %rax #rax = 0x3f + and %ecx, %eax #eax = j = n & 0x3f + + # f + (f*q) + lea L__two_to_jby64_table(%rip), %r10 + mulsd (%r10,%rax,8), %xmm0 + addsd (%r10,%rax,8), %xmm0 + + .p2align 4 + # m = (n - j) / 64 + psrad $6,%xmm4 + psllq $52,%xmm4 + paddq %xmm0, %xmm4 + cvtpd2ps %xmm4, %xmm0 + ret + +.p2align 4 +.L__y_is_zero: + pxor %xmm1, %xmm1 #return value in xmm1,input in xmm0 before calling + mov $2, %edi #code in edi + #call fname_special + pxor %xmm0,%xmm0#remove this if calling fname special + ret + +.p2align 4 +.L__y_is_inf: + mov $0x7f800000,%edx + movd %edx, %xmm1 + mov $3, %edi + #call fname_special + movdqa %xmm1,%xmm0#remove this if calling fname special + ret + +.p2align 4 +.L__y_is_nan: + movaps %xmm0,%xmm1 + addss %xmm1,%xmm1 + mov $1, %edi + #call fname_special + movdqa %xmm1,%xmm0 #remove this if calling fname special + ret + +.data +.align 16 +.L__max_exp_arg: .long 0x421A209B +.L__min_exp_arg: .long 0xC23369F4 +.L__real_64_by_log10of2: .quad 0x406A934F0979A371 # 64/log10(2) +.L__real_log10of2_by_64: .quad 0x3F734413509F79FF # log10of2_by_64 +.L__real_ln10: .quad 0x40026BB1BBB55516 # ln(10) +.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6 +.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2 + +.align 16 +.type L__two_to_jby64_table, @object +.size L__two_to_jby64_table, 512 +L__two_to_jby64_table: + .quad 0x3ff0000000000000 + .quad 0x3ff02c9a3e778061 + .quad 0x3ff059b0d3158574 + .quad 0x3ff0874518759bc8 + .quad 0x3ff0b5586cf9890f + .quad 0x3ff0e3ec32d3d1a2 + .quad 0x3ff11301d0125b51 + .quad 0x3ff1429aaea92de0 + .quad 0x3ff172b83c7d517b + .quad 0x3ff1a35beb6fcb75 + .quad 0x3ff1d4873168b9aa + .quad 0x3ff2063b88628cd6 + .quad 0x3ff2387a6e756238 + .quad 0x3ff26b4565e27cdd + .quad 0x3ff29e9df51fdee1 + .quad 0x3ff2d285a6e4030b + .quad 0x3ff306fe0a31b715 + .quad 0x3ff33c08b26416ff + .quad 0x3ff371a7373aa9cb + .quad 0x3ff3a7db34e59ff7 + .quad 0x3ff3dea64c123422 + .quad 0x3ff4160a21f72e2a + .quad 0x3ff44e086061892d + .quad 0x3ff486a2b5c13cd0 + .quad 0x3ff4bfdad5362a27 + .quad 0x3ff4f9b2769d2ca7 + .quad 0x3ff5342b569d4f82 + .quad 0x3ff56f4736b527da + .quad 0x3ff5ab07dd485429 + .quad 0x3ff5e76f15ad2148 + .quad 0x3ff6247eb03a5585 + .quad 0x3ff6623882552225 + .quad 0x3ff6a09e667f3bcd + .quad 0x3ff6dfb23c651a2f + .quad 0x3ff71f75e8ec5f74 + .quad 0x3ff75feb564267c9 + .quad 0x3ff7a11473eb0187 + .quad 0x3ff7e2f336cf4e62 + .quad 0x3ff82589994cce13 + .quad 0x3ff868d99b4492ed + .quad 0x3ff8ace5422aa0db + .quad 0x3ff8f1ae99157736 + .quad 0x3ff93737b0cdc5e5 + .quad 0x3ff97d829fde4e50 + .quad 0x3ff9c49182a3f090 + .quad 0x3ffa0c667b5de565 + .quad 0x3ffa5503b23e255d + .quad 0x3ffa9e6b5579fdbf + .quad 0x3ffae89f995ad3ad + .quad 0x3ffb33a2b84f15fb + .quad 0x3ffb7f76f2fb5e47 + .quad 0x3ffbcc1e904bc1d2 + .quad 0x3ffc199bdd85529c + .quad 0x3ffc67f12e57d14b + .quad 0x3ffcb720dcef9069 + .quad 0x3ffd072d4a07897c + .quad 0x3ffd5818dcfba487 + .quad 0x3ffda9e603db3285 + .quad 0x3ffdfc97337b9b5f + .quad 0x3ffe502ee78b3ff6 + .quad 0x3ffea4afa2a490da + .quad 0x3ffefa1bee615a27 + .quad 0x3fff50765b6e4540 + .quad 0x3fffa7c1819e90d8 + +
diff --git a/src/gas/exp2.S b/src/gas/exp2.S new file mode 100644 index 0000000..8e556d4 --- /dev/null +++ b/src/gas/exp2.S
@@ -0,0 +1,355 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(exp2) +#define fname_special _exp2_special@PLT +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.p2align 4 +.globl fname +.type fname,@function +fname: + ucomisd .L__max_exp2_arg(%rip), %xmm0 + ja .L__y_is_inf + jp .L__y_is_nan + ucomisd .L__min_exp2_arg(%rip), %xmm0 + jbe .L__y_is_zero + + # x * (64) + movapd %xmm0,%xmm2 + mulsd .L__real_64(%rip), %xmm2 + + # n = int( x * (64)) + cvttpd2dq %xmm2, %xmm1 #xmm1 = (int)n + cvtdq2pd %xmm1, %xmm2 #xmm2 = (double)n + movd %xmm1, %ecx + + # r = x - n * 1/64 + #r *= ln2; + mulsd .L__one_by_64(%rip),%xmm2 + addsd %xmm0,%xmm2 #xmm2 = r + mulsd .L__ln_2(%rip),%xmm2 + + #j = n & 0x3f + mov $0x3f, %rax + and %ecx, %eax #eax = j + # m = (n - j) / 64 + sar $6, %ecx #ecx = m + + # q = r + r^2*1/2 + r^3*1/6 + r^4 *1/24 + r^5*1/120 + r^6*1/720 + # q = r + r*r*(1/2 + r*(1/6+ r*(1/24 + r*(1/120 + r*(1/720))))) + movapd .L__real_1_by_720(%rip), %xmm3 #xmm3 = 1/720 + mulsd %xmm2, %xmm3 #xmm3 = r*1/720 + movapd .L__real_1_by_6(%rip), %xmm0 #xmm0 = 1/6 + movapd %xmm2, %xmm1 #xmm1 = r + mulsd %xmm2, %xmm0 #xmm0 = r*1/6 + addsd .L__real_1_by_120(%rip), %xmm3 #xmm3 = 1/120 + (r*1/720) + mulsd %xmm2, %xmm1 #xmm1 = r*r + addsd .L__real_1_by_2(%rip), %xmm0 #xmm0 = 1/2 + (r*1/6) + movapd %xmm1, %xmm4 #xmm4 = r*r + mulsd %xmm1, %xmm4 #xmm4 = (r*r) * (r*r) + mulsd %xmm2, %xmm3 #xmm3 = r * (1/120 + (r*1/720)) + mulsd %xmm1, %xmm0 #xmm0 = (r*r)*(1/2 + (r*1/6)) + addsd .L__real_1_by_24(%rip), %xmm3 #xmm3 = 1/24 + (r * (1/120 + (r*1/720))) + addsd %xmm2, %xmm0 #xmm0 = r + ((r*r)*(1/2 + (r*1/6))) + mulsd %xmm4, %xmm3 #xmm3 = ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720)))) + addsd %xmm3, %xmm0 #xmm0 = r + ((r*r)*(1/2 + (r*1/6))) + ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720)))) + + # (f)*(q) + f2 + f1 + cmp $0xfffffc02, %ecx # -1022 + lea .L__two_to_jby64_table(%rip), %rdx + lea .L__two_to_jby64_tail_table(%rip), %r11 + lea .L__two_to_jby64_head_table(%rip), %r10 + mulsd (%rdx,%rax,8), %xmm0 + addsd (%r11,%rax,8), %xmm0 + addsd (%r10,%rax,8), %xmm0 + + jle .L__process_denormal +.L__process_normal: + shl $52, %rcx + movd %rcx,%xmm2 + paddq %xmm2, %xmm0 + ret + +.p2align 4 +.L__process_denormal: + jl .L__process_true_denormal + ucomisd .L__real_one(%rip), %xmm0 + jae .L__process_normal +.L__process_true_denormal: + # here ( e^r < 1 and m = -1022 ) or m <= -1023 + add $1074, %ecx + mov $1, %rax + shl %cl, %rax + movd %rax, %xmm2 + mulsd %xmm2, %xmm0 + ret + +.p2align 4 +.L__y_is_inf: + mov $0x7ff0000000000000,%rax + movd %rax, %xmm1 + mov $3, %edi + #call fname_special + movdqa %xmm1,%xmm0 #remove this if call is made + ret + +.p2align 4 +.L__y_is_nan: + movapd %xmm0,%xmm1 + addsd %xmm0,%xmm1 + mov $1, %edi + #call fname_special + movdqa %xmm1,%xmm0 #remove this if call is made + ret + +.p2align 4 +.L__y_is_zero: + pxor %xmm1,%xmm1 + mov $2, %edi + #call fname_special + movdqa %xmm1,%xmm0 #remove this if call is made + ret + +.data +.align 16 +.L__max_exp2_arg: .quad 0x4090000000000000 +.L__min_exp2_arg: .quad 0xc090c80000000000 +.L__real_64: .quad 0x4050000000000000 # 64 +.L__ln_2: .quad 0x3FE62E42FEFA39EF +.L__one_by_64: .quad 0xbF90000000000000 + +.align 16 +.L__real_1_by_720: .quad 0x3f56c16c16c16c17 # 1/720 +.L__real_1_by_120: .quad 0x3f81111111111111 # 1/120 +.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6 +.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2 +.L__real_1_by_24: .quad 0x3fa5555555555555 # 1/24 +.L__real_one: .quad 0x3ff0000000000000 + +.align 16 +.L__two_to_jby64_table: + .quad 0x3ff0000000000000 + .quad 0x3ff02c9a3e778061 + .quad 0x3ff059b0d3158574 + .quad 0x3ff0874518759bc8 + .quad 0x3ff0b5586cf9890f + .quad 0x3ff0e3ec32d3d1a2 + .quad 0x3ff11301d0125b51 + .quad 0x3ff1429aaea92de0 + .quad 0x3ff172b83c7d517b + .quad 0x3ff1a35beb6fcb75 + .quad 0x3ff1d4873168b9aa + .quad 0x3ff2063b88628cd6 + .quad 0x3ff2387a6e756238 + .quad 0x3ff26b4565e27cdd + .quad 0x3ff29e9df51fdee1 + .quad 0x3ff2d285a6e4030b + .quad 0x3ff306fe0a31b715 + .quad 0x3ff33c08b26416ff + .quad 0x3ff371a7373aa9cb + .quad 0x3ff3a7db34e59ff7 + .quad 0x3ff3dea64c123422 + .quad 0x3ff4160a21f72e2a + .quad 0x3ff44e086061892d + .quad 0x3ff486a2b5c13cd0 + .quad 0x3ff4bfdad5362a27 + .quad 0x3ff4f9b2769d2ca7 + .quad 0x3ff5342b569d4f82 + .quad 0x3ff56f4736b527da + .quad 0x3ff5ab07dd485429 + .quad 0x3ff5e76f15ad2148 + .quad 0x3ff6247eb03a5585 + .quad 0x3ff6623882552225 + .quad 0x3ff6a09e667f3bcd + .quad 0x3ff6dfb23c651a2f + .quad 0x3ff71f75e8ec5f74 + .quad 0x3ff75feb564267c9 + .quad 0x3ff7a11473eb0187 + .quad 0x3ff7e2f336cf4e62 + .quad 0x3ff82589994cce13 + .quad 0x3ff868d99b4492ed + .quad 0x3ff8ace5422aa0db + .quad 0x3ff8f1ae99157736 + .quad 0x3ff93737b0cdc5e5 + .quad 0x3ff97d829fde4e50 + .quad 0x3ff9c49182a3f090 + .quad 0x3ffa0c667b5de565 + .quad 0x3ffa5503b23e255d + .quad 0x3ffa9e6b5579fdbf + .quad 0x3ffae89f995ad3ad + .quad 0x3ffb33a2b84f15fb + .quad 0x3ffb7f76f2fb5e47 + .quad 0x3ffbcc1e904bc1d2 + .quad 0x3ffc199bdd85529c + .quad 0x3ffc67f12e57d14b + .quad 0x3ffcb720dcef9069 + .quad 0x3ffd072d4a07897c + .quad 0x3ffd5818dcfba487 + .quad 0x3ffda9e603db3285 + .quad 0x3ffdfc97337b9b5f + .quad 0x3ffe502ee78b3ff6 + .quad 0x3ffea4afa2a490da + .quad 0x3ffefa1bee615a27 + .quad 0x3fff50765b6e4540 + .quad 0x3fffa7c1819e90d8 + +.align 16 +.L__two_to_jby64_head_table: + .quad 0x3ff0000000000000 + .quad 0x3ff02c9a30000000 + .quad 0x3ff059b0d0000000 + .quad 0x3ff0874510000000 + .quad 0x3ff0b55860000000 + .quad 0x3ff0e3ec30000000 + .quad 0x3ff11301d0000000 + .quad 0x3ff1429aa0000000 + .quad 0x3ff172b830000000 + .quad 0x3ff1a35be0000000 + .quad 0x3ff1d48730000000 + .quad 0x3ff2063b80000000 + .quad 0x3ff2387a60000000 + .quad 0x3ff26b4560000000 + .quad 0x3ff29e9df0000000 + .quad 0x3ff2d285a0000000 + .quad 0x3ff306fe00000000 + .quad 0x3ff33c08b0000000 + .quad 0x3ff371a730000000 + .quad 0x3ff3a7db30000000 + .quad 0x3ff3dea640000000 + .quad 0x3ff4160a20000000 + .quad 0x3ff44e0860000000 + .quad 0x3ff486a2b0000000 + .quad 0x3ff4bfdad0000000 + .quad 0x3ff4f9b270000000 + .quad 0x3ff5342b50000000 + .quad 0x3ff56f4730000000 + .quad 0x3ff5ab07d0000000 + .quad 0x3ff5e76f10000000 + .quad 0x3ff6247eb0000000 + .quad 0x3ff6623880000000 + .quad 0x3ff6a09e60000000 + .quad 0x3ff6dfb230000000 + .quad 0x3ff71f75e0000000 + .quad 0x3ff75feb50000000 + .quad 0x3ff7a11470000000 + .quad 0x3ff7e2f330000000 + .quad 0x3ff8258990000000 + .quad 0x3ff868d990000000 + .quad 0x3ff8ace540000000 + .quad 0x3ff8f1ae90000000 + .quad 0x3ff93737b0000000 + .quad 0x3ff97d8290000000 + .quad 0x3ff9c49180000000 + .quad 0x3ffa0c6670000000 + .quad 0x3ffa5503b0000000 + .quad 0x3ffa9e6b50000000 + .quad 0x3ffae89f90000000 + .quad 0x3ffb33a2b0000000 + .quad 0x3ffb7f76f0000000 + .quad 0x3ffbcc1e90000000 + .quad 0x3ffc199bd0000000 + .quad 0x3ffc67f120000000 + .quad 0x3ffcb720d0000000 + .quad 0x3ffd072d40000000 + .quad 0x3ffd5818d0000000 + .quad 0x3ffda9e600000000 + .quad 0x3ffdfc9730000000 + .quad 0x3ffe502ee0000000 + .quad 0x3ffea4afa0000000 + .quad 0x3ffefa1be0000000 + .quad 0x3fff507650000000 + .quad 0x3fffa7c180000000 + +.align 16 +.L__two_to_jby64_tail_table: + .quad 0x0000000000000000 + .quad 0x3e6cef00c1dcdef9 + .quad 0x3e48ac2ba1d73e2a + .quad 0x3e60eb37901186be + .quad 0x3e69f3121ec53172 + .quad 0x3e469e8d10103a17 + .quad 0x3df25b50a4ebbf1a + .quad 0x3e6d525bbf668203 + .quad 0x3e68faa2f5b9bef9 + .quad 0x3e66df96ea796d31 + .quad 0x3e368b9aa7805b80 + .quad 0x3e60c519ac771dd6 + .quad 0x3e6ceac470cd83f5 + .quad 0x3e5789f37495e99c + .quad 0x3e547f7b84b09745 + .quad 0x3e5b900c2d002475 + .quad 0x3e64636e2a5bd1ab + .quad 0x3e4320b7fa64e430 + .quad 0x3e5ceaa72a9c5154 + .quad 0x3e53967fdba86f24 + .quad 0x3e682468446b6824 + .quad 0x3e3f72e29f84325b + .quad 0x3e18624b40c4dbd0 + .quad 0x3e5704f3404f068e + .quad 0x3e54d8a89c750e5e + .quad 0x3e5a74b29ab4cf62 + .quad 0x3e5a753e077c2a0f + .quad 0x3e5ad49f699bb2c0 + .quad 0x3e6a90a852b19260 + .quad 0x3e56b48521ba6f93 + .quad 0x3e0d2ac258f87d03 + .quad 0x3e42a91124893ecf + .quad 0x3e59fcef32422cbe + .quad 0x3e68ca345de441c5 + .quad 0x3e61d8bee7ba46e1 + .quad 0x3e59099f22fdba6a + .quad 0x3e4f580c36bea881 + .quad 0x3e5b3d398841740a + .quad 0x3e62999c25159f11 + .quad 0x3e668925d901c83b + .quad 0x3e415506dadd3e2a + .quad 0x3e622aee6c57304e + .quad 0x3e29b8bc9e8a0387 + .quad 0x3e6fbc9c9f173d24 + .quad 0x3e451f8480e3e235 + .quad 0x3e66bbcac96535b5 + .quad 0x3e41f12ae45a1224 + .quad 0x3e55e7f6fd0fac90 + .quad 0x3e62b5a75abd0e69 + .quad 0x3e609e2bf5ed7fa1 + .quad 0x3e47daf237553d84 + .quad 0x3e12f074891ee83d + .quad 0x3e6b0aa538444196 + .quad 0x3e6cafa29694426f + .quad 0x3e69df20d22a0797 + .quad 0x3e640f12f71a1e45 + .quad 0x3e69f7490e4bb40b + .quad 0x3e4ed9942b84600d + .quad 0x3e4bdcdaf5cb4656 + .quad 0x3e5e2cffd89cf44c + .quad 0x3e452486cc2c7b9d + .quad 0x3e6cc2b44eee3fa4 + .quad 0x3e66dc8a80ce9f09 + .quad 0x3e39e90d82e90a7e + +
diff --git a/src/gas/exp2f.S b/src/gas/exp2f.S new file mode 100644 index 0000000..78c50e0 --- /dev/null +++ b/src/gas/exp2f.S
@@ -0,0 +1,193 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(exp2f) +#define fname_special _exp2f_special@PLT + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.p2align 4 +.globl fname +.type fname,@function +fname: + ucomiss .L__max_exp2_arg(%rip), %xmm0 + ja .L__y_is_inf + jp .L__y_is_nan + ucomiss .L__min_exp2_arg(%rip), %xmm0 + jb .L__y_is_zero + + cvtps2pd %xmm0, %xmm0 #xmm0 = (double)x + + # x * (64) + movapd %xmm0,%xmm3 #xmm3 = (double)x + #mulsd .L__sixtyfour(%rip), %xmm3 #xmm3 = x * (64) + paddq .L__sixtyfour(%rip), %xmm3 #xmm3 = x * (64) + + # n = int( x * (64) + cvtpd2dq %xmm3, %xmm4 #xmm4 = (int)n + cvtdq2pd %xmm4, %xmm2 #xmm2 = (double)n + + # r = x - n * 1/64 + # r *= ln(2) + mulsd .L__one_by_64(%rip),%xmm2 #xmm2 = n * 1/64 + movd %xmm4, %ecx #ecx = n + subsd %xmm2, %xmm0 #xmm0 = r + mulsd .L__ln2(%rip),%xmm0 #xmm0 = r = r*ln(2) + movapd %xmm0, %xmm1 #xmm1 = r + + # q + movsd .L__real_1_by_6(%rip), %xmm3 + mulsd %xmm0, %xmm3 #xmm3 = 1/6 * r + mulsd %xmm1, %xmm0 #xmm0 = r * r + addsd .L__real_1_by_2(%rip), %xmm3 #xmm3 = 1/2 + (1/6 * r) + mulsd %xmm3, %xmm0 #xmm0 = r*r*(1/2 + (1/6 * r)) + addsd %xmm1, %xmm0 #xmm0 = r+r*r*(1/2 + (1/6 * r)) + + #j = n & 0x3f + mov $0x3f, %rax #rax = 0x3f + and %ecx, %eax #eax = j = n & 0x3f + + # f + (f*q) + lea L__two_to_jby64_table(%rip), %r10 + mulsd (%r10,%rax,8), %xmm0 + addsd (%r10,%rax,8), %xmm0 + + .p2align 4 + # m = (n - j) / 64 + psrad $6,%xmm4 + psllq $52,%xmm4 + paddq %xmm0, %xmm4 + cvtpd2ps %xmm4, %xmm0 + ret + +.p2align 4 +.L__y_is_zero: + pxor %xmm1, %xmm1 #return value in xmm1,input in xmm0 before calling + mov $2, %edi #code in edi + #call fname_special + pxor %xmm0,%xmm0#remove this if calling fname special + ret + +.p2align 4 +.L__y_is_inf: + mov $0x7f800000,%edx + movd %edx, %xmm1 + mov $3, %edi + #call fname_special + movdqa %xmm1,%xmm0#remove this if calling fname special + ret + +.p2align 4 +.L__y_is_nan: + movaps %xmm0,%xmm1 + addss %xmm1,%xmm1 + mov $1, %edi + #call fname_special + movdqa %xmm1,%xmm0 #remove this if calling fname special + ret + +.data +.align 16 +.L__max_exp2_arg: .long 0x43000000 +.L__min_exp2_arg: .long 0xc3150000 +.align 16 +.L__sixtyfour: .quad 0x0060000000000000 # 64 +.L__one_by_64: .quad 0x3F90000000000000 # 1/64 +.L__ln2: .quad 0x3FE62E42FEFA39EF # ln(2) +.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6 +.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2 + +.align 16 +.type L__two_to_jby64_table, @object +.size L__two_to_jby64_table, 512 +L__two_to_jby64_table: + .quad 0x3ff0000000000000 + .quad 0x3ff02c9a3e778061 + .quad 0x3ff059b0d3158574 + .quad 0x3ff0874518759bc8 + .quad 0x3ff0b5586cf9890f + .quad 0x3ff0e3ec32d3d1a2 + .quad 0x3ff11301d0125b51 + .quad 0x3ff1429aaea92de0 + .quad 0x3ff172b83c7d517b + .quad 0x3ff1a35beb6fcb75 + .quad 0x3ff1d4873168b9aa + .quad 0x3ff2063b88628cd6 + .quad 0x3ff2387a6e756238 + .quad 0x3ff26b4565e27cdd + .quad 0x3ff29e9df51fdee1 + .quad 0x3ff2d285a6e4030b + .quad 0x3ff306fe0a31b715 + .quad 0x3ff33c08b26416ff + .quad 0x3ff371a7373aa9cb + .quad 0x3ff3a7db34e59ff7 + .quad 0x3ff3dea64c123422 + .quad 0x3ff4160a21f72e2a + .quad 0x3ff44e086061892d + .quad 0x3ff486a2b5c13cd0 + .quad 0x3ff4bfdad5362a27 + .quad 0x3ff4f9b2769d2ca7 + .quad 0x3ff5342b569d4f82 + .quad 0x3ff56f4736b527da + .quad 0x3ff5ab07dd485429 + .quad 0x3ff5e76f15ad2148 + .quad 0x3ff6247eb03a5585 + .quad 0x3ff6623882552225 + .quad 0x3ff6a09e667f3bcd + .quad 0x3ff6dfb23c651a2f + .quad 0x3ff71f75e8ec5f74 + .quad 0x3ff75feb564267c9 + .quad 0x3ff7a11473eb0187 + .quad 0x3ff7e2f336cf4e62 + .quad 0x3ff82589994cce13 + .quad 0x3ff868d99b4492ed + .quad 0x3ff8ace5422aa0db + .quad 0x3ff8f1ae99157736 + .quad 0x3ff93737b0cdc5e5 + .quad 0x3ff97d829fde4e50 + .quad 0x3ff9c49182a3f090 + .quad 0x3ffa0c667b5de565 + .quad 0x3ffa5503b23e255d + .quad 0x3ffa9e6b5579fdbf + .quad 0x3ffae89f995ad3ad + .quad 0x3ffb33a2b84f15fb + .quad 0x3ffb7f76f2fb5e47 + .quad 0x3ffbcc1e904bc1d2 + .quad 0x3ffc199bdd85529c + .quad 0x3ffc67f12e57d14b + .quad 0x3ffcb720dcef9069 + .quad 0x3ffd072d4a07897c + .quad 0x3ffd5818dcfba487 + .quad 0x3ffda9e603db3285 + .quad 0x3ffdfc97337b9b5f + .quad 0x3ffe502ee78b3ff6 + .quad 0x3ffea4afa2a490da + .quad 0x3ffefa1bee615a27 + .quad 0x3fff50765b6e4540 + .quad 0x3fffa7c1819e90d8 + +
diff --git a/src/gas/expf.S b/src/gas/expf.S new file mode 100644 index 0000000..cefa608 --- /dev/null +++ b/src/gas/expf.S
@@ -0,0 +1,201 @@ +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +#ifdef __x86_64__ +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# +# expf.S +# +# An implementation of the expf libm function. +# +# Prototype: +# +# float expf(float x); +# + +# +# Algorithm: +# Similar to one presnted in exp.S +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(expf) +#define fname_special _expf_special@PLT + +.text +.p2align 4 +.globl fname +.type fname,@function +fname: + ucomiss .L__max_exp_arg(%rip), %xmm0 + ja .L__y_is_inf + jp .L__y_is_nan + ucomiss .L__min_exp_arg(%rip), %xmm0 + jb .L__y_is_zero + + cvtps2pd %xmm0, %xmm0 #xmm0 = (double)x + + # x * (64/ln(2)) + movapd %xmm0,%xmm3 #xmm3 = (xouble)x + mulsd .L__real_64_by_log2(%rip), %xmm3 #xmm3 = x * (64/ln(2) + + # n = int( x * (64/ln(2)) ) + cvtpd2dq %xmm3, %xmm4 #xmm4 = (int)n + cvtdq2pd %xmm4, %xmm2 #xmm2 = (double)n + + # r = x - n * ln(2)/64 + mulsd .L__real_log2_by_64(%rip),%xmm2 #xmm2 = n * ln(2)/64 + movd %xmm4, %ecx #ecx = n + subsd %xmm2, %xmm0 #xmm0 = r + movapd %xmm0, %xmm1 #xmm1 = r + + # q + movsd .L__real_1_by_6(%rip), %xmm3 + mulsd %xmm0, %xmm3 #xmm3 = 1/6 * r + mulsd %xmm1, %xmm0 #xmm0 = r * r + addsd .L__real_1_by_2(%rip), %xmm3 #xmm3 = 1/2 + (1/6 * r) + mulsd %xmm3, %xmm0 #xmm0 = r*r*(1/2 + (1/6 * r)) + addsd %xmm1, %xmm0 #xmm0 = r+r*r*(1/2 + (1/6 * r)) + + #j = n & 0x3f + mov $0x3f, %rax #rax = 0x3f + and %ecx, %eax #eax = j = n & 0x3f + # m = (n - j) / 64 + sar $6, %ecx #ecx = m + shl $52, %rcx + + # (f)*(1+q) + lea L__two_to_jby64_table(%rip), %r10 + movsd (%r10,%rax,8), %xmm2 + mulsd %xmm2, %xmm0 + addsd %xmm2, %xmm0 + + movd %rcx, %xmm1 + paddq %xmm0, %xmm1 + cvtpd2ps %xmm1, %xmm0 + ret + +.p2align 4 +.L__y_is_zero: + + pxor %xmm1, %xmm1 #return value in xmm1,input in xmm0 before calling + mov $2, %edi #code in edi + jmp fname_special + +.p2align 4 +.L__y_is_inf: + + mov $0x7f800000,%edx + movd %edx, %xmm1 + mov $3, %edi + jmp fname_special + +.p2align 4 +.L__y_is_nan: + movaps %xmm0,%xmm1 + addss %xmm1,%xmm1 + mov $1, %edi + jmp fname_special + +.data +.align 16 +.L__max_exp_arg: .long 0x42B17218 +.L__min_exp_arg: .long 0xC2CE8ED0 +.L__real_64_by_log2: .quad 0x40571547652b82fe # 64/ln(2) +.L__real_log2_by_64: .quad 0x3f862e42fefa39ef # log2_by_64 +.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6 +.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2 + +.align 16 +.type L__two_to_jby64_table, @object +.size L__two_to_jby64_table, 512 +L__two_to_jby64_table: + .quad 0x3ff0000000000000 + .quad 0x3ff02c9a3e778061 + .quad 0x3ff059b0d3158574 + .quad 0x3ff0874518759bc8 + .quad 0x3ff0b5586cf9890f + .quad 0x3ff0e3ec32d3d1a2 + .quad 0x3ff11301d0125b51 + .quad 0x3ff1429aaea92de0 + .quad 0x3ff172b83c7d517b + .quad 0x3ff1a35beb6fcb75 + .quad 0x3ff1d4873168b9aa + .quad 0x3ff2063b88628cd6 + .quad 0x3ff2387a6e756238 + .quad 0x3ff26b4565e27cdd + .quad 0x3ff29e9df51fdee1 + .quad 0x3ff2d285a6e4030b + .quad 0x3ff306fe0a31b715 + .quad 0x3ff33c08b26416ff + .quad 0x3ff371a7373aa9cb + .quad 0x3ff3a7db34e59ff7 + .quad 0x3ff3dea64c123422 + .quad 0x3ff4160a21f72e2a + .quad 0x3ff44e086061892d + .quad 0x3ff486a2b5c13cd0 + .quad 0x3ff4bfdad5362a27 + .quad 0x3ff4f9b2769d2ca7 + .quad 0x3ff5342b569d4f82 + .quad 0x3ff56f4736b527da + .quad 0x3ff5ab07dd485429 + .quad 0x3ff5e76f15ad2148 + .quad 0x3ff6247eb03a5585 + .quad 0x3ff6623882552225 + .quad 0x3ff6a09e667f3bcd + .quad 0x3ff6dfb23c651a2f + .quad 0x3ff71f75e8ec5f74 + .quad 0x3ff75feb564267c9 + .quad 0x3ff7a11473eb0187 + .quad 0x3ff7e2f336cf4e62 + .quad 0x3ff82589994cce13 + .quad 0x3ff868d99b4492ed + .quad 0x3ff8ace5422aa0db + .quad 0x3ff8f1ae99157736 + .quad 0x3ff93737b0cdc5e5 + .quad 0x3ff97d829fde4e50 + .quad 0x3ff9c49182a3f090 + .quad 0x3ffa0c667b5de565 + .quad 0x3ffa5503b23e255d + .quad 0x3ffa9e6b5579fdbf + .quad 0x3ffae89f995ad3ad + .quad 0x3ffb33a2b84f15fb + .quad 0x3ffb7f76f2fb5e47 + .quad 0x3ffbcc1e904bc1d2 + .quad 0x3ffc199bdd85529c + .quad 0x3ffc67f12e57d14b + .quad 0x3ffcb720dcef9069 + .quad 0x3ffd072d4a07897c + .quad 0x3ffd5818dcfba487 + .quad 0x3ffda9e603db3285 + .quad 0x3ffdfc97337b9b5f + .quad 0x3ffe502ee78b3ff6 + .quad 0x3ffea4afa2a490da + .quad 0x3ffefa1bee615a27 + .quad 0x3fff50765b6e4540 + .quad 0x3fffa7c1819e90d8 + + +#endif
diff --git a/src/gas/expm1.S b/src/gas/expm1.S new file mode 100644 index 0000000..dff043c --- /dev/null +++ b/src/gas/expm1.S
@@ -0,0 +1,359 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(expm1) + +#ifdef __ELF__ + .section .note.GNU-stack,"",@progbits +#endif + + .text + .p2align 4 +.globl fname + .type fname, @function + +fname: + + ucomisd .L__max_expm1_arg(%rip),%xmm0 #check if(x > 709.8) + ja .L__Max_Arg + jp .L__Max_Arg + ucomisd .L__min_expm1_arg(%rip),%xmm0 #if(x < -37.42994775023704) + jb .L__Min_Arg + ucomisd .L__log_OneMinus_OneByFour(%rip),%xmm0 + jbe .L__Normal_Flow + ucomisd .L__log_OnePlus_OneByFour(%rip),%xmm0 + jb .L__Small_Arg + + .p2align 4 +.L__Normal_Flow: + movapd %xmm0,%xmm1 #xmm1 = x + mulsd .L__thirtyTwo_by_ln2(%rip),%xmm1 #xmm1 = x*thirtyTwo_by_ln2 + ucomisd .L__zero(%rip),%xmm1 #check if temp < 0.0 + jae .L__Add_Point_Five + subsd .L__point_Five(%rip),%xmm1 + jmp .L__next +.L__Add_Point_Five: + addsd .L__point_Five(%rip),%xmm1 #xmm1 = temp +/- 0.5 +.L__next: + cvttpd2dq %xmm1,%xmm2 #xmm2 = (int)n + cvtdq2pd %xmm2,%xmm1 #xmm1 = (double)n + movapd %xmm2,%xmm3 #xmm3 = (int)n + psrad $5,%xmm2 #xmm2 = m + pslld $27,%xmm3 + psrld $27,%xmm3 #xmm3 = j + movd %xmm3,%edx #edx = j + movd %xmm2,%ecx #ecx = m + + movlhps %xmm1,%xmm1 #xmm1 = n,n + mulpd .L__Ln2By32_MinusTrailLead(%rip),%xmm1 + movapd %xmm0,%xmm2 + subsd %xmm1,%xmm2 #xmm2 = r1 + psrldq $8,%xmm1 #xmm1 = r2 + movapd %xmm2,%xmm3 #xmm3 = r1 + addsd %xmm1,%xmm3 #xmm3 = r + #q = r*(r*(A1.f64 + r*(A2.f64 + r*(A3.f64 + r*(A4.f64 + r*(A5.f64)))))); + movapd %xmm3,%xmm4 + mulsd .L__A5(%rip),%xmm4 + addsd .L__A4(%rip),%xmm4 + mulsd %xmm3,%xmm4 + addsd .L__A3(%rip),%xmm4 + mulsd %xmm3,%xmm4 + addsd .L__A2(%rip),%xmm4 + mulsd %xmm3,%xmm4 + addsd .L__A1(%rip),%xmm4 + mulsd %xmm3,%xmm4 + mulsd %xmm4,%xmm3 #xmm3 = q + + shl $4,%edx + lea S_lead_and_trail_table(%rip),%rax + movdqa (%rax,%rdx,1),%xmm5 #xmm5 = S_T,S_L + + #p = (r2+q) + r1; + addsd %xmm3,%xmm1 + addsd %xmm1,%xmm2 #xmm2 = p + + #s = S_L.f64 + S_T.f64; + movhlps %xmm5,%xmm4 #xmm4 = S_T + movapd %xmm4,%xmm3 #xmm3 = S_T + addsd %xmm5,%xmm3 #xmm3 = s + + cmp $52,%ecx #check m > 52 + jg .L__M_Above_52 + cmp $-7,%ecx #check if m < -7 + jl .L__M_Below_Minus7 + #(-8 < m) && (m < 53) + movapd %xmm2,%xmm3 #xmm3 = p + addsd .L__One(%rip),%xmm3 #xmm3 = 1+p + mulsd %xmm4,%xmm3 #xmm3 = S_T.f64 *(1+p) + mulsd %xmm5,%xmm2 #xmm2 = S_L*p + addsd %xmm3,%xmm2 #xmm2 = (S_L.f64*p+ S_T.f64 *(1+p)) + mov $1023,%edx + sub %ecx,%edx #edx = twopmm + shl $52,%rdx + movd %rdx,%xmm1 #xmm1 = twopmm + subsd %xmm1,%xmm5 #xmm5 = S_L.f64 - twopmm.f64 + addsd %xmm5,%xmm2 + shl $52,%rcx + movd %rcx,%xmm0 #xmm0 = twopm + paddq %xmm2,%xmm0 #xmm0 = twopm *(xmm2) + ret + + .p2align 4 +.L__M_Above_52: + cmp $1024,%ecx #check if m = 1024 + je .L__M_Equals_1024 + #twopm.f64 * (S_L.f64 + (s*p+(S_T.f64 - twopmm.f64)));// 2^-m should not be calculated if m>105 + mov $1023,%edx + sub %ecx,%edx #edx = twopmm + shl $52,%rdx + movd %rdx,%xmm1 #xmm1 = twopmm + subsd %xmm1,%xmm4 #xmm4 = S_T - twopmm + mulsd %xmm3,%xmm2 #xmm2 = s*p + addsd %xmm4,%xmm2 + addsd %xmm5,%xmm2 + shl $52,%rcx + movd %rcx,%xmm0 #xmm0 = twopm + paddq %xmm2,%xmm0 + ret + + .p2align 4 +.L__M_Below_Minus7: + #twopm.f64 * (S_L.f64 + (s*p + S_T.f64)) - 1; + mulsd %xmm3,%xmm2 #xmm2 = s*p + addsd %xmm4,%xmm2 #xmm2 = (s*p + S_T.f64) + addsd %xmm5,%xmm2 #xmm2 = (S_L.f64 + (s*p + S_T.f64)) + shl $52,%rcx + movd %rcx,%xmm0 #xmm0 = twopm + paddq %xmm2,%xmm0 #xmm0 = twopm *(xmm2) + subsd .L__One(%rip),%xmm0 + ret + + .p2align 4 +.L__M_Equals_1024: + mov $0x4000000000000000,%rax #1024 at exponent + mulsd %xmm3,%xmm2 #xmm2 = s*p + addsd %xmm4,%xmm2 #xmm2 = (s*p) + S_T + addsd %xmm5,%xmm2 #xmm2 = S_L + ((s*p) + S_T) + movd %rax,%xmm1 #xmm1 = twopm + paddq %xmm2,%xmm1 + movd %xmm1,%rax + mov $0x7FF0000000000000,%rcx + and %rcx,%rax + cmp %rcx,%rax #check if we reached inf + je .L__return_Inf + movapd %xmm1,%xmm0 + ret + + .p2align 4 +.L__Small_Arg: + movapd %xmm0,%xmm1 + psllq $1,%xmm1 + psrlq $1,%xmm1 #xmm1 = abs(x) + ucomisd .L__Five_Pont_FiveEMinus17(%rip),%xmm1 + jb .L__VeryTinyArg + mov $0x01E0000000000000,%rax #30 in exponents place + #u = (twop30.f64 * x + x) - twop30.f64 * x; + movd %rax,%xmm1 + paddq %xmm0,%xmm1 #xmm1 = twop30.f64 * x + movapd %xmm1,%xmm2 + addsd %xmm0,%xmm2 #xmm2 = (twop30.f64 * x + x) + subsd %xmm1,%xmm2 #xmm2 = u + movapd %xmm0,%xmm1 + subsd %xmm2,%xmm1 #xmm1 = v = x-u + movapd %xmm2,%xmm3 #xmm3 = u + mulsd %xmm2,%xmm3 #xmm3 = u*u + mulsd .L__point_Five(%rip),%xmm3 #xmm3 = y = u*u*0.5 + #z = v * (x + u) * 0.5; + movapd %xmm0,%xmm4 + addsd %xmm2,%xmm4 + mulsd %xmm1,%xmm4 + mulsd .L__point_Five(%rip),%xmm4 #xmm4 = z + + #q = x*x*x*(A1.f64 + x*(A2.f64 + x*(A3.f64 + x*(A4.f64 + x*(A5.f64 + x*(A6.f64 + x*(A7.f64 + x*(A8.f64 + x*(A9.f64))))))))); + movapd %xmm0,%xmm5 + mulsd .L__B9(%rip),%xmm5 + addsd .L__B8(%rip),%xmm5 + mulsd %xmm0,%xmm5 + addsd .L__B7(%rip),%xmm5 + mulsd %xmm0,%xmm5 + addsd .L__B6(%rip),%xmm5 + mulsd %xmm0,%xmm5 + addsd .L__B5(%rip),%xmm5 + mulsd %xmm0,%xmm5 + addsd .L__B4(%rip),%xmm5 + mulsd %xmm0,%xmm5 + addsd .L__B3(%rip),%xmm5 + mulsd %xmm0,%xmm5 + addsd .L__B2(%rip),%xmm5 + mulsd %xmm0,%xmm5 + addsd .L__B1(%rip),%xmm5 + mulsd %xmm0,%xmm5 + mulsd %xmm0,%xmm5 + mulsd %xmm0,%xmm5 #xmm5 = q + + ucomisd .L__TwopM7(%rip),%xmm3 + jb .L__returnNext + addsd %xmm4,%xmm1 #xmm1 = v+z + addsd %xmm5,%xmm1 #xmm1 = q+(v+z) + addsd %xmm3,%xmm2 #xmm2 = u+y + addsd %xmm2,%xmm1 + movapd %xmm1,%xmm0 + ret + .p2align 4 +.L__returnNext: + addsd %xmm5,%xmm4 #xmm4 = q +z + addsd %xmm4,%xmm3 #xmm3 = y+(q+z) + addsd %xmm3,%xmm0 + ret + + .p2align 4 +.L__VeryTinyArg: + #(twop100.f64 * x + xabs.f64) * twopm100.f64); + mov $0x0640000000000000,%rax #100 at exponent's place + movd %rax,%xmm2 + paddq %xmm2,%xmm0 + addsd %xmm1,%xmm0 + psubq %xmm2,%xmm0 + ret + + + .p2align 4 +.L__Max_Arg: + movd %xmm0,%rcx + mov $0x7ff0000000000000,%rax + cmp %rax,%rcx #x is either Nan or Inf + jb .L__return_Inf + mov $0x000fffffffffffff,%rdx #check if x is Nan + and %rdx,%rcx + jne .L__Nan +.L__return_Inf: + movd %rax,%xmm0 + #call error_handler + ret + .p2align 4 +.L__Nan: + addsd %xmm0,%xmm0 + ret + ret + + .p2align 4 +.L__Min_Arg: + mov $0xBFF0000000000000,%rax #return -1 + #call error handler + movd %rax,%xmm0 + ret + +.data +.align 16 +.L__max_expm1_arg: + .quad 0x40862E6666666666 +.L__min_expm1_arg: + .quad 0xC042B708872320E1 +.L__log_OneMinus_OneByFour: + .quad 0xBFD269621134DB93 +.L__log_OnePlus_OneByFour: + .quad 0x3FCC8FF7C79A9A22 +.L__thirtyTwo_by_ln2: + .quad 0x40471547652B82FE +.L__zero: + .quad 0x0000000000000000 +.L__point_Five: + .quad 0x3FE0000000000000 + +.align 16 +.L__Ln2By32_MinusTrailLead: + .octa 0xBD8473DE6AF278ED3F962E42FEF00000 +.L__A5: + .quad 0x3F56C1728D739765 +.L__A4: + .quad 0x3F811115B7AA905E +.L__A3: + .quad 0x3FA5555555545D4E +.L__A2: + .quad 0x3FC5555555548F7C +.L__A1: + .quad 0x3FE0000000000000 +.L__One: + .quad 0x3FF0000000000000 + +.align 16 +# .type two_to_jby32_table, @object +# .size two_to_jby32_table, 512 +S_lead_and_trail_table: + .octa 0x00000000000000003FF0000000000000 + .octa 0x3D0A1D73E2A475B43FF059B0D3158540 + .octa 0x3CEEC5317256E3083FF0B5586CF98900 + .octa 0x3CF0A4EBBF1AED933FF11301D0125B40 + .octa 0x3D0D6E6FBE4628763FF172B83C7D5140 + .octa 0x3D053C02DC0144C83FF1D4873168B980 + .octa 0x3D0C3360FD6D8E0B3FF2387A6E756200 + .octa 0x3D009612E8AFAD123FF29E9DF51FDEC0 + .octa 0x3CF52DE8D5A463063FF306FE0A31B700 + .octa 0x3CE54E28AA05E8A93FF371A7373AA9C0 + .octa 0x3D011ADA0911F09F3FF3DEA64C123400 + .octa 0x3D068189B7A04EF83FF44E0860618900 + .octa 0x3D038EA1CBD7F6213FF4BFDAD5362A00 + .octa 0x3CBDF0A83C49D86A3FF5342B569D4F80 + .octa 0x3D04AC64980A8C8F3FF5AB07DD485400 + .octa 0x3CD2C7C3E81BF4B73FF6247EB03A5580 + .octa 0x3CE921165F626CDD3FF6A09E667F3BC0 + .octa 0x3D09EE91B87977853FF71F75E8EC5F40 + .octa 0x3CDB5F54408FDB373FF7A11473EB0180 + .octa 0x3CF28ACF88AFAB353FF82589994CCE00 + .octa 0x3CFB5BA7C55A192D3FF8ACE5422AA0C0 + .octa 0x3D027A280E1F92A03FF93737B0CDC5C0 + .octa 0x3CF01C7C46B071F33FF9C49182A3F080 + .octa 0x3CFC8B424491CAF83FFA5503B23E2540 + .octa 0x3D06AF439A68BB993FFAE89F995AD380 + .octa 0x3CDBAA9EC206AD4F3FFB7F76F2FB5E40 + .octa 0x3CFC2220CB12A0923FFC199BDD855280 + .octa 0x3D048A81E5E8F4A53FFCB720DCEF9040 + .octa 0x3CDC976816BAD9B83FFD5818DCFBA480 + .octa 0x3CFEB968CAC39ED33FFDFC97337B9B40 + .octa 0x3CF9858F73A18F5E3FFEA4AFA2A490C0 + .octa 0x3C99D3E12DD8A18B3FFF50765B6E4540 + +.align 16 +.L__Five_Pont_FiveEMinus17: + .quad 0x3C90000000000000 +.L__B9: + .quad 0x3E5A2836AA646B96 +.L__B8: + .quad 0x3E928295484734EA +.L__B7: + .quad 0x3EC71E14BFE3DB59 +.L__B6: + .quad 0x3EFA019F635825C4 +.L__B5: + .quad 0x3F2A01A01159DD2D +.L__B4: + .quad 0x3F56C16C16CE14C6 +.L__B3: + .quad 0x3F8111111111A9F3 +.L__B2: + .quad 0x3FA55555555554B6 +.L__B1: + .quad 0x3FC5555555555549 +.L__TwopM7: + .quad 0x3F80000000000000
diff --git a/src/gas/expm1f.S b/src/gas/expm1f.S new file mode 100644 index 0000000..6e7ca03 --- /dev/null +++ b/src/gas/expm1f.S
@@ -0,0 +1,323 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(expm1f) +#define fname_special _expm1f_special@PLT + +#ifdef __ELF__ + .section .note.GNU-stack,"",@progbits +#endif + + .text + .p2align 4 +.globl fname + .type fname, @function + +fname: + ucomiss .L__max_expm1_arg(%rip),%xmm0 ##if(x > max_expm1_arg) + ja .L__Max_Arg + jp .L__Max_Arg + ucomiss .L__log_OnePlus_OneByFour(%rip),%xmm0 ##if(x < log_OnePlus_OneByFour) + jae .L__Normal_Flow + ucomiss .L__log_OneMinus_OneByFour(%rip),%xmm0 ##if(x > log_OneMinus_OneByFour) + ja .L__Small_Arg + ucomiss .L__min_expm1_arg(%rip),%xmm0 ##if(x < min_expm1_arg) + jb .L__Min_Arg + + .p2align 4 +.L__Normal_Flow: + movaps %xmm0,%xmm1 #xmm1 = x + mulss .L__thirtyTwo_by_ln2(%rip),%xmm1 #xmm1 = x*thirtyTwo_by_ln2 + movd %xmm1,%eax #eax = x*thirtyTwo_by_ln2 + and $0x80000000,%eax #get the sign of x*thirtyTwo_by_ln2 + or $0x3F000000,%eax #make +/- 0.5 + movd %eax,%xmm2 #xmm2 = +/- 0.5 + addss %xmm2,%xmm1 #xmm1 = (x*32/ln2) +/- 0.5 + cvttps2dq %xmm1,%xmm2 #xmm2 = n = (int)(temp) + mov $0x0000001f,%edx + movd %edx,%xmm1 + andps %xmm2,%xmm1 #xmm1 = j + movd %xmm2,%ecx #ecx = n + sarl $5, %ecx #ecx = m = n >> 5 + #xor %rdx,%rdx #make it zeros, to be used for address + movd %xmm1,%edx #edx = j + lea S_lead_and_trail_table(%rip),%rax + movsd (%rax,%rdx,8),%xmm3 #xmm3 = S_T,S_L + punpckldq %xmm2,%xmm1 #xmm1 = n,j + psubd %xmm1,%xmm2 #xmm2 = n1 + punpcklqdq %xmm2,%xmm1 #xmm1 = n1,n,j + cvtdq2ps %xmm1,%xmm1 #xmm1 = (float)(n1,n,j) + + #r2 = -(n*ln2_by_ThirtyTwo_trail); + #r1 = (x-n1*ln2_by_ThirtyTwo_lead) - j*ln2_by_ThirtyTwo_lead; + mulps .L__Ln2By32_LeadTrailLead(%rip),%xmm1 + movhlps %xmm1,%xmm2 #xmm2 = n1*ln2/32lead + movaps %xmm0,%xmm4 #xmm4 = x + subss %xmm2,%xmm4 #xmm4 = x - n1*ln2/32lead + subss %xmm1,%xmm4 #xmm4 = r1 + psrldq $4,%xmm1 #xmm1 = -r2 should take care of sign later + + #r = r1 + r2; + movaps %xmm4,%xmm7 #xmm7 = r1 + subss %xmm1,%xmm4 #xmm4 = r = r1-(-r2) = r1 + r2 + + #q = r*r*(B1+r*(B2)); + movaps %xmm4,%xmm6 #xmm6 = r + mulss .L__B2_f(%rip),%xmm6 #xmm6 = r * B2 + addss .L__B1_f(%rip),%xmm6 #xmm6 = B1 + (r * B2) + mulss %xmm4,%xmm6 + mulss %xmm4,%xmm6 #xmm6 = q + + #p = (r2+q) + r1; + subss %xmm1,%xmm6 + addss %xmm7,%xmm6 #xmm6 = p + + #s = S_L.f32 + S_T.f32; + movdqa %xmm3,%xmm2 #xmm2 = S_T,S_L + psrldq $4,%xmm2 #xmm2 = S_T + movaps %xmm2,%xmm5 #xmm5 = S_T + addss %xmm3,%xmm2 #xmm2 = s + + cmp $0xfffffff9,%ecx #Check m < -7 + jl .L__M_Below_Minus7 + cmp $23,%ecx #Check m > 23 + jg .L__M_Above_23 + # -8 < m < 24 + #twopm.f32 * ((S_L.f32 - twopmm.f32) + (S_L.f32*p+ S_T.f32 *(1+p))); + movaps %xmm3,%xmm2 #xmm2 = S_L + mulss %xmm6,%xmm2 #xmm2 = S_L * p + addss .L__One_f(%rip),%xmm6 #xmm6 = 1+p + mulss %xmm5,%xmm6 #xmm6 = S_T *(1+p) + addss %xmm6,%xmm2 #xmm2 = (S_L.f32*p+ S_T.f32 *(1+p)) + mov $127,%eax + sub %ecx,%eax #eax = 127 - m + shl $23,%eax #eax = 2^-m + movd %eax,%xmm1 + subss %xmm1,%xmm3 #xmm3 = (S_L.f32 - twopmm.f32) + addss %xmm3,%xmm2 #xmm2 = ((S_L.f32 - twopmm.f32) + (S_L.f32*p+ S_T.f32 *(1+p))) + shl $23,%ecx + movd %ecx,%xmm0 + paddd %xmm2,%xmm0 + ret + + .p2align 4 +.L__M_Below_Minus7: + #twopm.f32 * (S_L.f32 + (s*p + S_T.f32)) - 1; + mulss %xmm6,%xmm2 #xmm2 = s*p + addss %xmm5,%xmm2 #xmm2 = s*p + S_T + addss %xmm3,%xmm2 #xmm2 = (S_L.f32 + (s*p + S_T.f32)) + shl $23,%ecx + movd %ecx,%xmm0 + paddd %xmm2,%xmm0 + subss .L__One_f(%rip),%xmm0 + ret + + .p2align 4 +.L__M_Above_23: + #twopm.f32 * (S_L.f32 + (s*p+(S_T.f32 - twopmm.f32))); + cmp $0x00000080,%ecx #Check m < 128 + je .L__M_Equals_128 + cmp $47,%ecx #Check m > 47 + ja .L__M_Above_47 + mov $127,%eax + sub %ecx,%eax #eax = 127 - m + shl $23,%eax #eax = 2^-m + movd %eax,%xmm1 + subss %xmm1,%xmm5 #xmm5 = S_T.f32 - twopmm.f32 + + .p2align 4 +.L__M_Above_47: + shl $23,%ecx + mulss %xmm6,%xmm2 #xmm2 = s*p + addss %xmm5,%xmm2 + addss %xmm3,%xmm2 + movd %ecx,%xmm0 + paddd %xmm2,%xmm0 + ret + + .p2align 4 +.L__M_Equals_128: + mov $0x3f800000,%ecx #127 at exponent + mulss %xmm6,%xmm2 #xmm2 = s*p + addss %xmm5,%xmm2 #xmm2 = s*p + S_T + addss %xmm3,%xmm2 #xmm2 = (S_L.f32 + (s*p + S_T.f32)) + movd %ecx,%xmm1 #127 + paddd %xmm2,%xmm1 #2^127*(S_L.f32 + (s*p + S_T.f32)) + mov $0x00800000,%ecx #multiply with one more 2 + movd %ecx,%xmm2 + paddd %xmm2,%xmm1 + movd %xmm1,%ecx + and $0x7f800000,%ecx #check if we reached +inf + cmp $0x7f800000,%ecx + je .L__Overflow + movdqa %xmm1,%xmm0 + ret + + .p2align 4 +.L__Small_Arg: + movd %xmm0,%eax + and $0x7fffffff,%eax #eax = abs(x) + cmp $0x33000000,%eax #check abs(x) < 2^-25 + jl .L__VeryTiny_Arg + #log(1-1/4) < x < log(1+1/4) + #q = x*x*x*(A1 + x*(A2 + x*(A3 + x*(A4 + x*(A5))))); + movdqa %xmm0,%xmm1 + mulss .L__A5_f(%rip),%xmm1 + addss .L__A4_f(%rip),%xmm1 + mulss %xmm0,%xmm1 + addss .L__A3_f(%rip),%xmm1 + mulss %xmm0,%xmm1 + addss .L__A2_f(%rip),%xmm1 + mulss %xmm0,%xmm1 + addss .L__A1_f(%rip),%xmm1 + mulss %xmm0,%xmm1 + mulss %xmm0,%xmm1 + mulss %xmm0,%xmm1 + cvtps2pd %xmm0,%xmm2 + movdqa %xmm2,%xmm0 + mulsd %xmm0,%xmm2 + mulsd .L__PointFive(%rip),%xmm2 + addsd %xmm2,%xmm0 + cvtps2pd %xmm1,%xmm2 + addsd %xmm0,%xmm2 + cvtpd2ps %xmm2,%xmm0 + ret + + .p2align 4 +.L__Min_Arg: + mov $0xBF800000,%eax + #call handle_error + movd %eax,%xmm0 + ret + + .p2align 4 +.L__Max_Arg: + movd %xmm0,%eax + and $0x7fffffff,%eax #eax = abs(x) + cmp $0x7f800000,%eax #check for Nan + jae .L__Nan +.L__Overflow: + mov $0x7f800000,%eax + #call handle_error + movd %eax,%xmm0 + ret +.L__Nan: + and $0x007fffff,%eax + je .L__Overflow + addss %xmm0,%xmm0 + ret + + .p2align 4 +.L__VeryTiny_Arg: + #((twopm.f32 * x + xabs.f32) * twopmm.f32); + movd %eax, %xmm1 #xmm1 = abs(x) + mov $0x32000000, %eax #100 at exponent's place + movd %eax, %xmm2 + paddd %xmm2, %xmm0 + addss %xmm1, %xmm0 + psubd %xmm2, %xmm0 + ret + +.data +.align 16 +.type S_lead_and_trail_table, @object +.size S_lead_and_trail_table, 256 +S_lead_and_trail_table: + .quad 0x000000003F800000 + .quad 0x355315853F82CD80 + .quad 0x34D9F3123F85AAC0 + .quad 0x35E8092E3F889800 + .quad 0x3471F5463F8B95C0 + .quad 0x36E62D173F8EA400 + .quad 0x361B9D593F91C3C0 + .quad 0x36BEA3FC3F94F4C0 + .quad 0x36C146373F9837C0 + .quad 0x36E6E7553F9B8D00 + .quad 0x36C982473F9EF500 + .quad 0x34C0C3123FA27040 + .quad 0x36354D8B3FA5FEC0 + .quad 0x3655A7543FA9A140 + .quad 0x36FBA90B3FAD5800 + .quad 0x36D6074B3FB123C0 + .quad 0x36CCCFE73FB504C0 + .quad 0x36BD1D8C3FB8FB80 + .quad 0x368E7D603FBD0880 + .quad 0x35CCA6673FC12C40 + .quad 0x36A845543FC56700 + .quad 0x36F619B93FC9B980 + .quad 0x35C151F83FCE2480 + .quad 0x366C8F893FD2A800 + .quad 0x36F32B5A3FD744C0 + .quad 0x36DE5F6C3FDBFB80 + .quad 0x367761553FE0Ccc0 + .quad 0x355CEF903FE5B900 + .quad 0x355CFBA53FEAC0c0 + .quad 0x36E66F733FEFE480 + .quad 0x36F454923FF52540 + .quad 0x36CB6DC93FFA8380 + +.align 16 +.L__Ln2By32_LeadTrailLead: + .octa 0x333FBE8E3CB17200333FBE8E3CB17200 + +.L__max_expm1_arg: + .long 0x42B19999 +.L__log_OnePlus_OneByFour: + .long 0x3E647FBF + +.L__log_OneMinus_OneByFour: + .long 0xBE934B11 + +.L__min_expm1_arg: + .long 0xC18AA122 + +.L__thirtyTwo_by_ln2: + .long 0x4238AA3B + +.align 16 +.L__B2_f: + .long 0x3E2AAAEC +.L__B1_f: + .long 0x3F000044 +.L__One_f: + .long 0x3F800000 +.L__PointFive: + .quad 0x3FE0000000000000 + +.align 16 +.L__A1_f: + .long 0x3E2AAAAA +.L__A2_f: + .long 0x3D2AAAA0 +.L__A3_f: + .long 0x3C0889FF +.L__A4_f: + .long 0x3AB64DE5 +.L__A5_f: + .long 0x394AB327 + + + + +
diff --git a/src/gas/fabs.S b/src/gas/fabs.S new file mode 100644 index 0000000..a436d0f --- /dev/null +++ b/src/gas/fabs.S
@@ -0,0 +1,63 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# fabs.S +# +# An implementation of the fabs libm function. +# +# Prototype: +# +# double fabs(double x); +# + +# +# Algorithm: AND the Most Significant Bit of the +# double precision number with 0 to get the +# floating point absolute. +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(fabs) +#define fname_special _fabs_special + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + #input is in xmm0, which contains the final result also. + andpd .L__fabs_and_mask(%rip), %xmm0 # <result> latency = 3 + ret + + +.align 16 +.L__fabs_and_mask: .quad 0x7FFFFFFFFFFFFFFF + .quad 0x0 + +
diff --git a/src/gas/fabsf.S b/src/gas/fabsf.S new file mode 100644 index 0000000..8a6ea27 --- /dev/null +++ b/src/gas/fabsf.S
@@ -0,0 +1,67 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# fabsf.S +# +# An implementation of the fabsf libm function. +# +# Prototype: +# +# float fabsf(float x); +# + +# +# Algorithm: AND the Most Significant Bit of the +# single precision number with 0 to get the +# floating point absolute. +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(fabsf) +#define fname_special _fabsf_special + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + #input is in xmm0, which contains the final result also. + andps .L__fabsf_and_mask(%rip), %xmm0 # <result> latency = 3 + ret + + +.align 16 +.L__fabsf_and_mask: .long 0x7FFFFFFF + .long 0x0 + .quad 0x0 + + + + +
diff --git a/src/gas/fdim.S b/src/gas/fdim.S new file mode 100644 index 0000000..14e382f --- /dev/null +++ b/src/gas/fdim.S
@@ -0,0 +1,63 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#fdim.S +# +# An implementation of the fdim libm function. +# +# The fdim functions determine the positive difference between their arguments +# +# x - y if x > y +# +0 if x <= y +# +# +# +# Prototype: +# +# double fdim(double x, double y) +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(fdim) + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + MOVAPD %xmm0,%xmm2 + SUBSD %xmm1,%xmm0 + CMPNLESD %xmm1,%xmm2 + ANDPD %xmm2,%xmm0 + + ret
diff --git a/src/gas/fdimf.S b/src/gas/fdimf.S new file mode 100644 index 0000000..0b7a966 --- /dev/null +++ b/src/gas/fdimf.S
@@ -0,0 +1,61 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#fdimf.S +# +# An implementation of the fdimf libm function. +# +# The fdim functions determine the positive difference between their arguments +# +# x - y if x > y +# +0 if x <= y +# +# Prototype: +# +# float fdimf(float x, float y) +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(fdimf) + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + MOVAPD %xmm0,%xmm2 + SUBSS %xmm1,%xmm0 + CMPNLESS %xmm1,%xmm2 + ANDPS %xmm2,%xmm0 + + ret
diff --git a/src/gas/fmax.S b/src/gas/fmax.S new file mode 100644 index 0000000..ec0d787 --- /dev/null +++ b/src/gas/fmax.S
@@ -0,0 +1,66 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#fmax.S +# +# An implementation of the fmax libm function. +# +# The fmax functions determine the maximum numeric value of their arguments. +# +# Prototype: +# +# double fmax(double x, double y) +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(fmax) + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + MOVAPD %xmm0,%xmm3 + + MAXSD %xmm1,%xmm0 + MOVAPD %xmm0,%xmm2 + + #If the input is nan then specal case to return the other operand + CMPEQSD %xmm2,%xmm2 + PAND %xmm2,%xmm0 + + PANDN %xmm3,%xmm2 + POR %xmm2,%xmm0 + + ret +
diff --git a/src/gas/fmaxf.S b/src/gas/fmaxf.S new file mode 100644 index 0000000..828832f --- /dev/null +++ b/src/gas/fmaxf.S
@@ -0,0 +1,66 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#fmaxf.S +# +# An implementation of the fmaxf libm function. +# +# The fmax functions determine the maximum numeric value of their arguments. +# +# Prototype: +# +# float fmaxf(float x, float y) +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(fmaxf) + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + MOVAPD %xmm0,%xmm3 + + MAXSS %xmm1,%xmm0 + MOVAPD %xmm0,%xmm2 + + #If the input is nan then specal case to return the other operand + CMPEQSS %xmm2,%xmm2 + PAND %xmm2,%xmm0 + + PANDN %xmm3,%xmm2 + POR %xmm2,%xmm0 + + ret +
diff --git a/src/gas/fmin.S b/src/gas/fmin.S new file mode 100644 index 0000000..79b3fb6 --- /dev/null +++ b/src/gas/fmin.S
@@ -0,0 +1,66 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#fmin.S +# +# An implementation of the fmin libm function. +# +# The fmin functions determine the minimum numeric value of their arguments +# +# Prototype: +# +# double fmin(double x, double y) +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(fmin) + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + MOVAPD %xmm0,%xmm3 + + MINSD %xmm1,%xmm0 + MOVAPD %xmm0,%xmm2 + + #If the input is nan then specal case to return the other operand + CMPEQSD %xmm2,%xmm2 + PAND %xmm2,%xmm0 + + PANDN %xmm3,%xmm2 + POR %xmm2,%xmm0 + + ret +
diff --git a/src/gas/fminf.S b/src/gas/fminf.S new file mode 100644 index 0000000..34ee357 --- /dev/null +++ b/src/gas/fminf.S
@@ -0,0 +1,66 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +#fminf.S +# +# An implementation of the fminf libm function. +# +# The fmin functions determine the minimum numeric value of their arguments +# +# +# Prototype: +# +# float fminf(float x, float y) +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(fminf) + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + MOVAPD %xmm0,%xmm3 + + MINSS %xmm1,%xmm0 + MOVAPD %xmm0,%xmm2 + + #If the input is nan then specal case to return the other operand + CMPEQSS %xmm2,%xmm2 + PAND %xmm2,%xmm0 + + PANDN %xmm3,%xmm2 + POR %xmm2,%xmm0 + + ret
diff --git a/src/gas/fmod.S b/src/gas/fmod.S new file mode 100644 index 0000000..bc1eeae --- /dev/null +++ b/src/gas/fmod.S
@@ -0,0 +1,223 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# fmod.S +# +# An implementation of the fabs libm function. +# +# Prototype: +# +# double fmod(double x,double y); +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(fmod) +#define fname_special _fmod_special + + +# local variable storage offsets +.equ temp_x, 0x0 +.equ temp_y, 0x10 +.equ stack_size, 0x28 + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + mov .L__exp_mask_64(%rip), %r10 + #move the input to GP registers + movd %xmm0,%r8 + movd %xmm1,%r9 + movapd %xmm0,%xmm4 + movapd %xmm1,%xmm5 + movapd .L__Nan_64(%rip),%xmm6 + and %r10,%r8 + and %r10,%r9 + ror $52, %r8 + ror $52, %r9 + #ifeither of the exponents is zero we do the fmod calculation in x87 mode + test %r8, %r8 + jz .L__LargeExpDiffComputation + mov %r9,%r10 + test %r9, %r9 + jz .L__LargeExpDiffComputation + sub %r9,%r8 + cmp $52,%r8 + jge .L__LargeExpDiffComputation + pand %xmm6,%xmm4 + pand %xmm6,%xmm5 + comisd %xmm5,%xmm4 + jp .L__InputIsNaN # if either of xmm1 or xmm0 is a NaN then + # parity flag is set + jz .L__Input_Is_Equal + jbe .L__ReturnImmediate + cmp $0x7FF,%r8 + jz .L__Dividend_Is_Infinity + + #calculation without using the x87 FPU +.L__DirectComputation: + movapd %xmm4,%xmm2 + movapd %xmm5,%xmm3 + divsd %xmm3,%xmm2 + cvttsd2siq %xmm2,%r8 + cvtsi2sdq %r8,%xmm2 + + #multiplication in QUAD Precision + #Since the below commented multiplication resulted in an error + #we had to implement a quad precision multiplication. + #LOGIC behind Quad Precision Multiplication + #x = hx + tx by setting x's last 27 bits to null + #y = hy + ty similar to x + movapd .L__27bit_andingmask_64(%rip),%xmm4 + #movddup %xmm5,%xmm5 #[x,x] + #movddup %xmm2,%xmm2 #[y,y] + + movapd %xmm5,%xmm1 # x + movapd %xmm2,%xmm6 # y + movapd %xmm2,%xmm7 # + mulsd %xmm5,%xmm7 # xmm7 = z = x*y + andpd %xmm4,%xmm1 + andpd %xmm4,%xmm2 + subsd %xmm1,%xmm5 # xmm1 = hx xmm5 = tx + subsd %xmm2,%xmm6 # xmm2 = hy xmm6 = ty + + movapd %xmm1,%xmm4 # copy hx + mulsd %xmm2,%xmm4 # xmm4 = hx*hy + subsd %xmm7,%xmm4 # xmm4 = (hx*hy - z) + mulsd %xmm6,%xmm1 # xmm1 = hx * ty + addsd %xmm1,%xmm4 # xmm4 = ((hx * hy - *z) + hx * ty) + mulsd %xmm5,%xmm2 # xmm2 = tx * hy + addsd %xmm2,%xmm4 # xmm4 = (((hx * hy - *z) + hx * ty) + tx * hy) + mulsd %xmm5,%xmm6 # xmm6 = tx * ty + addsd %xmm4,%xmm6 # xmm6 = (((hx * hy - *z) + hx * ty) + tx * hy) + tx * ty; + #xmm6 and xmm7 contain the quad precision result + #v = dx - c; + #dx = v + (((dx - v) - c) - cc); + movapd %xmm0,%xmm1 # copy the input number + pand .L__Nan_64(%rip),%xmm1 + movapd %xmm1,%xmm2 # xmm2 = dx = xmm1 + subsd %xmm7,%xmm1 # v = dx - c + subsd %xmm1,%xmm2 # (dx - v) + subsd %xmm7,%xmm2 # ((dx - v) - c) + subsd %xmm6,%xmm2 # (((dx - v) - c) - cc) + addsd %xmm1,%xmm2 # xmm2 = dx = v + (((dx - v) - c) - cc) + # xmm3 = w + comisd .L__Zero_64(%rip),%xmm2 + jae .L__positive + addsd %xmm3,%xmm2 +.L__positive: +# return x < 0.0? -dx : dx; +.L__Finish: + comisd .L__Zero_64(%rip), %xmm0 + ja .L__Not_Negative_Number1 + +.L__Negative_Number1: + movapd .L__Zero_64(%rip),%xmm0 + subsd %xmm2,%xmm0 + ret +.L__Not_Negative_Number1: + movapd %xmm2,%xmm0 + ret + + #calculation using the x87 FPU + #For numbers whose exponent of either of the divisor, + #or dividends are 0. Or for numbers whose exponential + #diff is grater than 52 +.align 16 +.L__LargeExpDiffComputation: + sub $stack_size, %rsp + movsd %xmm0, temp_x(%rsp) + movsd %xmm1, temp_y(%rsp) + ffree %st(0) + ffree %st(1) + fldl temp_y(%rsp) + fldl temp_x(%rsp) + fnclex +.align 32 +.L__repeat: + fprem #Calculate remainder by dividing st(0) with st(1) + #fprem operation sets x87 condition codes, + #it will set the C2 code to 1 if a partial remainder is calculated + fnstsw %ax + and $0x0400,%ax # Stores Floating-Point Store Status Word into the accumulator + # we need to check only the C2 bit of the Condition codes + cmp $0x0400,%ax # Checks whether the bit 10(C2) is set or not + # IF its set then a partial remainder was calculated + jz .L__repeat + #store the result from the FPU stack to memory + fstpl temp_x(%rsp) + fstpl temp_y(%rsp) + movsd temp_x(%rsp), %xmm0 + add $stack_size, %rsp + ret + + #IF both the inputs are equal +.L__Input_Is_Equal: + cmp $0x7FF,%r8 + jz .L__Dividend_Is_Infinity + cmp $0x7FF,%r9 + jz .L__InputIsNaN + movsd %xmm0,%xmm1 + pand .L__sign_mask_64(%rip),%xmm1 + movsd .L__Zero_64(%rip),%xmm0 + por %xmm1,%xmm0 + ret + +.L__InputIsNaN: + por .L__QNaN_mask_64(%rip),%xmm0 + por .L__exp_mask_64(%rip),%xmm0 +.L__Dividend_Is_Infinity: + ret + +#Case when x < y +.L__ReturnImmediate: + ret + + + +.align 32 +.L__sign_mask_64: .quad 0x8000000000000000 + .quad 0x0 +.L__exp_mask_64: .quad 0x7FF0000000000000 + .quad 0x0 +.L__27bit_andingmask_64: .quad 0xfffffffff8000000 + .quad 0 +.L__2p52_mask_64: .quad 0x4330000000000000 + .quad 0 +.L__Zero_64: .quad 0x0 + .quad 0 +.L__QNaN_mask_64: .quad 0x0008000000000000 + .quad 0 +.L__Nan_64: .quad 0x7FFFFFFFFFFFFFFF + .quad 0 +
diff --git a/src/gas/fmodf.S b/src/gas/fmodf.S new file mode 100644 index 0000000..c31d619 --- /dev/null +++ b/src/gas/fmodf.S
@@ -0,0 +1,181 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# fmodf.S +# +# An implementation of the fabs libm function. +# +# Prototype: +# +# float fmodf(float x,float y); +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(fmodf) +#define fname_special _fmodf_special + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + mov .L__exp_mask_64(%rip), %rdi + movapd .L__sign_mask_64(%rip),%xmm6 + cvtss2sd %xmm0,%xmm2 # double x + cvtss2sd %xmm1,%xmm3 # double y + pand %xmm6,%xmm2 + pand %xmm6,%xmm3 + movd %xmm2,%rax + movd %xmm3,%r8 + mov %rax,%r11 + mov %r8,%r9 + movsd %xmm2,%xmm4 + #take the exponents of both x and y + and %rdi,%rax + and %rdi,%r8 + ror $52, %rax + ror $52, %r8 + # ifeither of the exponents is infinity + cmp $0X7FF,%rax + jz .L__InputIsNaN + cmp $0X7FF,%r8 + jz .L__InputIsNaNOrInf + + cmp $0,%r8 + jz .L__Divisor_Is_Zero + + cmp %r9, %r11 + jz .L__Input_Is_Equal + jb .L__ReturnImmediate + + xor %rcx,%rcx + mov $24,%rdx + movsd .L__One_64(%rip),%xmm7 # xmm7 = scale + cmp %rax,%r8 + jae .L__y_is_greater + #xmm3 = dy + sub %r8,%rax + div %dl # al = ntimes + mov %al,%cl # cl = ntimes + and $0xFF,%ax # set everything t o zero except al + mul %dl # ax = dl * al = 24* ntimes + add $1023, %rax + shl $52,%rax + movd %rax,%xmm7 # xmm7 = scale +.L__y_is_greater: + mulsd %xmm3,%xmm7 # xmm7 = scale * dy + movsd .L__2pminus24_decimal(%rip),%xmm6 + +.align 16 +.L__Start_Loop: + dec %cl + js .L__End_Loop + divsd %xmm7,%xmm4 # xmm7 = (dx / w) + cvttsd2siq %xmm4,%rax + cvtsi2sdq %rax,%xmm4 # xmm4 = t = (double)((int)(dx / w)) + mulsd %xmm7,%xmm4 # xmm4 = w*t + mulsd %xmm6,%xmm7 # w*= scale + subsd %xmm4,%xmm2 # xmm2 = dx -= w*t + movsd %xmm2,%xmm4 # xmm4 = dx + jmp .L__Start_Loop +.L__End_Loop: + divsd %xmm7,%xmm4 # xmm7 = (dx / w) + cvttsd2siq %xmm4,%rax + cvtsi2sdq %rax,%xmm4 # xmm4 = t = (double)((int)(dx / w)) + mulsd %xmm7,%xmm4 # xmm4 = w*t + subsd %xmm4,%xmm2 # xmm2 = dx -= w*t + comiss .L__Zero_64(%rip),%xmm0 + jb .L__Negative +.L__Positive: + cvtsd2ss %xmm2,%xmm0 + ret +.L__Negative: + movsd .L__MinusZero_64(%rip),%xmm0 + subsd %xmm2,%xmm0 + cvtsd2ss %xmm0,%xmm0 + ret + +.align 16 +.L__Input_Is_Equal: + cmp $0x7FF,%rax + jz .L__Dividend_Is_Infinity + cmp $0x7FF,%r8 + jz .L__InputIsNaNOrInf + movsd %xmm0,%xmm1 + pand .L__sign_bit_32(%rip),%xmm1 + movss .L__Zero_64(%rip),%xmm0 + por %xmm1,%xmm0 + ret + +.L__InputIsNaNOrInf: + comiss %xmm0,%xmm1 + jp .L__InputIsNaN + ret +.L__Divisor_Is_Zero: +.L__InputIsNaN: + por .L__exp_mask_32(%rip),%xmm0 +.L__Dividend_Is_Infinity: + por .L__QNaN_mask_32(%rip),%xmm0 + ret + +#Case when x < y +.L__ReturnImmediate: + #xmm0 contains the input and is the result + ret + + + +.align 32 +.L__sign_bit_32: .quad 0x8000000080000000 + .quad 0x0 +.L__exp_mask_64: .quad 0x7FF0000000000000 + .quad 0x0 +.L__exp_mask_32: .quad 0x000000007F800000 + .quad 0x0 +.L__27bit_andingmask_64: .quad 0xfffffffff8000000 + .quad 0 +.L__2p52_mask_64: .quad 0x4330000000000000 + .quad 0 +.L__One_64: .quad 0x3FF0000000000000 + .quad 0 +.L__Zero_64: .quad 0x0 + .quad 0 +.L__MinusZero_64: .quad 0x8000000000000000 + .quad 0 +.L__QNaN_mask_32: .quad 0x0000000000400000 + .quad 0 +.L__sign_mask_64: .quad 0x7FFFFFFFFFFFFFFF + .quad 0 +.L__2pminus24_decimal: .quad 0x3E70000000000000 + .quad 0 +
diff --git a/src/gas/log.S b/src/gas/log.S new file mode 100644 index 0000000..7068c6d --- /dev/null +++ b/src/gas/log.S
@@ -0,0 +1,1155 @@ +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +#ifdef __x86_64__ +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# +# log.S +# +# An implementation of the log libm function. +# +# Prototype: +# +# double log(double x); +# + +# +# Algorithm: +# +# Based on: +# Ping-Tak Peter Tang +# "Table-driven implementation of the logarithm function in IEEE +# floating-point arithmetic" +# ACM Transactions on Mathematical Software (TOMS) +# Volume 16, Issue 4 (December 1990) +# +# +# x very close to 1.0 is handled differently, for x everywhere else +# a brief explanation is given below +# +# x = (2^m)*A +# x = (2^m)*(G+g) with (1 <= G < 2) and (g <= 2^(-9)) +# x = (2^m)*2*(G/2+g/2) +# x = (2^m)*2*(F+f) with (0.5 <= F < 1) and (f <= 2^(-10)) +# +# Y = (2^(-1))*(2^(-m))*(2^m)*A +# Now, range of Y is: 0.5 <= Y < 1 +# +# F = 0x100 + (first 8 mantissa bits) + (9th mantissa bit) +# Now, range of F is: 256 <= F <= 512 +# F = F / 512 +# Now, range of F is: 0.5 <= F <= 1 +# +# f = -(Y-F), with (f <= 2^(-10)) +# +# log(x) = m*log(2) + log(2) + log(F-f) +# log(x) = m*log(2) + log(2) + log(F) + log(1-(f/F)) +# log(x) = m*log(2) + log(2*F) + log(1-r) +# +# r = (f/F), with (r <= 2^(-9)) +# r = f*(1/F) with (1/F) precomputed to avoid division +# +# log(x) = m*log(2) + log(G) - poly +# +# log(G) is precomputed +# poly = (r + (r^2)/2 + (r^3)/3 + (r^4)/4) + (r^5)/5) + (r^6)/6)) +# +# log(2) and log(G) need to be maintained in extra precision +# to avoid losing precision in the calculations +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(log) +#define fname_special _log_special@PLT + + +# local variable storage offsets +.equ p_temp, 0x0 +.equ stack_size, 0x18 + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + sub $stack_size, %rsp + + # compute exponent part + xor %rax, %rax + movdqa %xmm0, %xmm3 + movsd %xmm0, %xmm4 + psrlq $52, %xmm3 + movd %xmm0, %rax + psubq .L__mask_1023(%rip), %xmm3 + movdqa %xmm0, %xmm2 + cvtdq2pd %xmm3, %xmm6 # xexp + + # NaN or inf + movdqa %xmm0, %xmm5 + andpd .L__real_inf(%rip), %xmm5 + comisd .L__real_inf(%rip), %xmm5 + je .L__x_is_inf_or_nan + + # check for negative numbers or zero + xorpd %xmm5, %xmm5 + comisd %xmm5, %xmm0 + jbe .L__x_is_zero_or_neg + + pand .L__real_mant(%rip), %xmm2 + subsd .L__real_one(%rip), %xmm4 + + comisd .L__mask_1023_f(%rip), %xmm6 + je .L__denormal_adjust + +.L__continue_common: + + # compute index into the log tables + mov %rax, %r9 + and .L__mask_mant_all8(%rip), %rax + and .L__mask_mant9(%rip), %r9 + shl $1, %r9 + add %r9, %rax + mov %rax, p_temp(%rsp) + + # near one codepath + andpd .L__real_notsign(%rip), %xmm4 + comisd .L__real_threshold(%rip), %xmm4 + jb .L__near_one + + # F, Y + movsd p_temp(%rsp), %xmm1 + shr $44, %rax + por .L__real_half(%rip), %xmm2 + por .L__real_half(%rip), %xmm1 + lea .L__log_F_inv(%rip), %r9 + + # f = F - Y, r = f * inv + subsd %xmm2, %xmm1 + mulsd (%r9,%rax,8), %xmm1 + + movsd %xmm1, %xmm2 + movsd %xmm1, %xmm0 + lea .L__log_256_lead(%rip), %r9 + + # poly + movsd .L__real_1_over_6(%rip), %xmm3 + movsd .L__real_1_over_3(%rip), %xmm1 + mulsd %xmm2, %xmm3 + mulsd %xmm2, %xmm1 + mulsd %xmm2, %xmm0 + movsd %xmm0, %xmm4 + addsd .L__real_1_over_5(%rip), %xmm3 + addsd .L__real_1_over_2(%rip), %xmm1 + mulsd %xmm0, %xmm4 + mulsd %xmm2, %xmm3 + mulsd %xmm0, %xmm1 + addsd .L__real_1_over_4(%rip), %xmm3 + addsd %xmm2, %xmm1 + mulsd %xmm4, %xmm3 + addsd %xmm3, %xmm1 + + # m*log(2) + log(G) - poly + movsd .L__real_log2_tail(%rip), %xmm5 + mulsd %xmm6, %xmm5 + subsd %xmm1, %xmm5 + + movsd (%r9,%rax,8), %xmm0 + lea .L__log_256_tail(%rip), %rdx + movsd (%rdx,%rax,8), %xmm2 + addsd %xmm5, %xmm2 + + movsd .L__real_log2_lead(%rip), %xmm4 + mulsd %xmm6, %xmm4 + addsd %xmm4, %xmm0 + + addsd %xmm2, %xmm0 + + add $stack_size, %rsp + ret + +.p2align 4,,15 +.L__near_one: + + # r = x - 1.0 + movsd .L__real_two(%rip), %xmm2 + subsd .L__real_one(%rip), %xmm0 # r + + addsd %xmm0, %xmm2 + movsd %xmm0, %xmm1 + divsd %xmm2, %xmm1 # r/(2+r) = u/2 + + movsd .L__real_ca2(%rip), %xmm4 + movsd .L__real_ca4(%rip), %xmm5 + + movsd %xmm0, %xmm6 + mulsd %xmm1, %xmm6 # correction + + addsd %xmm1, %xmm1 # u + movsd %xmm1, %xmm2 + + mulsd %xmm1, %xmm2 # u^2 + + mulsd %xmm2, %xmm4 + mulsd %xmm2, %xmm5 + + addsd .L__real_ca1(%rip), %xmm4 + addsd .L__real_ca3(%rip), %xmm5 + + mulsd %xmm1, %xmm2 # u^3 + mulsd %xmm2, %xmm4 + + mulsd %xmm2, %xmm2 + mulsd %xmm1, %xmm2 # u^7 + mulsd %xmm2, %xmm5 + + addsd %xmm5, %xmm4 + subsd %xmm6, %xmm4 + addsd %xmm4, %xmm0 + + add $stack_size, %rsp + ret + +.L__denormal_adjust: + por .L__real_one(%rip), %xmm2 + subsd .L__real_one(%rip), %xmm2 + movsd %xmm2, %xmm5 + pand .L__real_mant(%rip), %xmm2 + movd %xmm2, %rax + psrlq $52, %xmm5 + psubd .L__mask_2045(%rip), %xmm5 + cvtdq2pd %xmm5, %xmm6 + jmp .L__continue_common + +.p2align 4,,15 +.L__x_is_zero_or_neg: + jne .L__x_is_neg + + movsd .L__real_ninf(%rip), %xmm1 + mov .L__flag_x_zero(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__x_is_neg: + + movsd .L__real_qnan(%rip), %xmm1 + mov .L__flag_x_neg(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__x_is_inf_or_nan: + + cmp .L__real_inf(%rip), %rax + je .L__finish + + cmp .L__real_ninf(%rip), %rax + je .L__x_is_neg + + mov .L__real_qnanbit(%rip), %r9 + and %rax, %r9 + jnz .L__finish + + or .L__real_qnanbit(%rip), %rax + movd %rax, %xmm1 + mov .L__flag_x_nan(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__finish: + add $stack_size, %rsp + ret + + +.data + +.align 16 + +# these codes and the ones in the corresponding .c file have to match +.L__flag_x_zero: .long 00000001 +.L__flag_x_neg: .long 00000002 +.L__flag_x_nan: .long 00000003 + +.align 16 + +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0000000000000000 +.L__real_inf: .quad 0x7ff0000000000000 # +inf + .quad 0x0000000000000000 +.L__real_qnan: .quad 0x7ff8000000000000 # qNaN + .quad 0x0000000000000000 +.L__real_qnanbit: .quad 0x0008000000000000 + .quad 0x0000000000000000 +.L__real_mant: .quad 0x000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000000000000000 +.L__mask_1023: .quad 0x00000000000003ff + .quad 0x0000000000000000 +.L__mask_001: .quad 0x0000000000000001 + .quad 0x0000000000000000 + +.L__mask_mant_all8: .quad 0x000ff00000000000 + .quad 0x0000000000000000 +.L__mask_mant9: .quad 0x0000080000000000 + .quad 0x0000000000000000 + +.L__real_log2_lead: .quad 0x3fe62e42e0000000 # log2_lead 6.93147122859954833984e-01 + .quad 0x0000000000000000 +.L__real_log2_tail: .quad 0x3e6efa39ef35793c # log2_tail 5.76999904754328540596e-08 + .quad 0x0000000000000000 + +.L__real_two: .quad 0x4000000000000000 # 2 + .quad 0x0000000000000000 + +.L__real_one: .quad 0x3ff0000000000000 # 1 + .quad 0x0000000000000000 + +.L__real_half: .quad 0x3fe0000000000000 # 1/2 + .quad 0x0000000000000000 + +.L__mask_100: .quad 0x0000000000000100 + .quad 0x0000000000000000 + +.L__real_1_over_512: .quad 0x3f60000000000000 + .quad 0x0000000000000000 + +.L__real_1_over_2: .quad 0x3fe0000000000000 + .quad 0x0000000000000000 +.L__real_1_over_3: .quad 0x3fd5555555555555 + .quad 0x0000000000000000 +.L__real_1_over_4: .quad 0x3fd0000000000000 + .quad 0x0000000000000000 +.L__real_1_over_5: .quad 0x3fc999999999999a + .quad 0x0000000000000000 +.L__real_1_over_6: .quad 0x3fc5555555555555 + .quad 0x0000000000000000 + +.L__mask_1023_f: .quad 0x0c08ff80000000000 + .quad 0x0000000000000000 + +.L__mask_2045: .quad 0x00000000000007fd + .quad 0x0000000000000000 + +.L__real_threshold: .quad 0x3fb0000000000000 # .0625 + .quad 0x0000000000000000 + +.L__real_notsign: .quad 0x7ffFFFFFFFFFFFFF # ^sign bit + .quad 0x0000000000000000 + +.L__real_ca1: .quad 0x3fb55555555554e6 # 8.33333333333317923934e-02 + .quad 0x0000000000000000 +.L__real_ca2: .quad 0x3f89999999bac6d4 # 1.25000000037717509602e-02 + .quad 0x0000000000000000 +.L__real_ca3: .quad 0x3f62492307f1519f # 2.23213998791944806202e-03 + .quad 0x0000000000000000 +.L__real_ca4: .quad 0x3f3c8034c85dfff0 # 4.34887777707614552256e-04 + .quad 0x0000000000000000 + +.align 16 +.L__log_256_lead: + .quad 0x0000000000000000 + .quad 0x3f6ff00aa0000000 + .quad 0x3f7fe02a60000000 + .quad 0x3f87dc4750000000 + .quad 0x3f8fc0a8b0000000 + .quad 0x3f93cea440000000 + .quad 0x3f97b91b00000000 + .quad 0x3f9b9fc020000000 + .quad 0x3f9f829b00000000 + .quad 0x3fa1b0d980000000 + .quad 0x3fa39e87b0000000 + .quad 0x3fa58a5ba0000000 + .quad 0x3fa77458f0000000 + .quad 0x3fa95c8300000000 + .quad 0x3fab42dd70000000 + .quad 0x3fad276b80000000 + .quad 0x3faf0a30c0000000 + .quad 0x3fb0759830000000 + .quad 0x3fb16536e0000000 + .quad 0x3fb253f620000000 + .quad 0x3fb341d790000000 + .quad 0x3fb42edcb0000000 + .quad 0x3fb51b0730000000 + .quad 0x3fb60658a0000000 + .quad 0x3fb6f0d280000000 + .quad 0x3fb7da7660000000 + .quad 0x3fb8c345d0000000 + .quad 0x3fb9ab4240000000 + .quad 0x3fba926d30000000 + .quad 0x3fbb78c820000000 + .quad 0x3fbc5e5480000000 + .quad 0x3fbd4313d0000000 + .quad 0x3fbe270760000000 + .quad 0x3fbf0a30c0000000 + .quad 0x3fbfec9130000000 + .quad 0x3fc0671510000000 + .quad 0x3fc0d77e70000000 + .quad 0x3fc1478580000000 + .quad 0x3fc1b72ad0000000 + .quad 0x3fc2266f10000000 + .quad 0x3fc29552f0000000 + .quad 0x3fc303d710000000 + .quad 0x3fc371fc20000000 + .quad 0x3fc3dfc2b0000000 + .quad 0x3fc44d2b60000000 + .quad 0x3fc4ba36f0000000 + .quad 0x3fc526e5e0000000 + .quad 0x3fc59338d0000000 + .quad 0x3fc5ff3070000000 + .quad 0x3fc66acd40000000 + .quad 0x3fc6d60fe0000000 + .quad 0x3fc740f8f0000000 + .quad 0x3fc7ab8900000000 + .quad 0x3fc815c0a0000000 + .quad 0x3fc87fa060000000 + .quad 0x3fc8e928d0000000 + .quad 0x3fc9525a90000000 + .quad 0x3fc9bb3620000000 + .quad 0x3fca23bc10000000 + .quad 0x3fca8becf0000000 + .quad 0x3fcaf3c940000000 + .quad 0x3fcb5b5190000000 + .quad 0x3fcbc28670000000 + .quad 0x3fcc296850000000 + .quad 0x3fcc8ff7c0000000 + .quad 0x3fccf63540000000 + .quad 0x3fcd5c2160000000 + .quad 0x3fcdc1bca0000000 + .quad 0x3fce270760000000 + .quad 0x3fce8c0250000000 + .quad 0x3fcef0adc0000000 + .quad 0x3fcf550a50000000 + .quad 0x3fcfb91860000000 + .quad 0x3fd00e6c40000000 + .quad 0x3fd0402590000000 + .quad 0x3fd071b850000000 + .quad 0x3fd0a324e0000000 + .quad 0x3fd0d46b50000000 + .quad 0x3fd1058bf0000000 + .quad 0x3fd1368700000000 + .quad 0x3fd1675ca0000000 + .quad 0x3fd1980d20000000 + .quad 0x3fd1c898c0000000 + .quad 0x3fd1f8ff90000000 + .quad 0x3fd22941f0000000 + .quad 0x3fd2596010000000 + .quad 0x3fd2895a10000000 + .quad 0x3fd2b93030000000 + .quad 0x3fd2e8e2b0000000 + .quad 0x3fd31871c0000000 + .quad 0x3fd347dd90000000 + .quad 0x3fd3772660000000 + .quad 0x3fd3a64c50000000 + .quad 0x3fd3d54fa0000000 + .quad 0x3fd4043080000000 + .quad 0x3fd432ef20000000 + .quad 0x3fd4618bc0000000 + .quad 0x3fd4900680000000 + .quad 0x3fd4be5f90000000 + .quad 0x3fd4ec9730000000 + .quad 0x3fd51aad80000000 + .quad 0x3fd548a2c0000000 + .quad 0x3fd5767710000000 + .quad 0x3fd5a42ab0000000 + .quad 0x3fd5d1bdb0000000 + .quad 0x3fd5ff3070000000 + .quad 0x3fd62c82f0000000 + .quad 0x3fd659b570000000 + .quad 0x3fd686c810000000 + .quad 0x3fd6b3bb20000000 + .quad 0x3fd6e08ea0000000 + .quad 0x3fd70d42e0000000 + .quad 0x3fd739d7f0000000 + .quad 0x3fd7664e10000000 + .quad 0x3fd792a550000000 + .quad 0x3fd7bede00000000 + .quad 0x3fd7eaf830000000 + .quad 0x3fd816f410000000 + .quad 0x3fd842d1d0000000 + .quad 0x3fd86e9190000000 + .quad 0x3fd89a3380000000 + .quad 0x3fd8c5b7c0000000 + .quad 0x3fd8f11e80000000 + .quad 0x3fd91c67e0000000 + .quad 0x3fd9479410000000 + .quad 0x3fd972a340000000 + .quad 0x3fd99d9580000000 + .quad 0x3fd9c86b00000000 + .quad 0x3fd9f323e0000000 + .quad 0x3fda1dc060000000 + .quad 0x3fda484090000000 + .quad 0x3fda72a490000000 + .quad 0x3fda9cec90000000 + .quad 0x3fdac718c0000000 + .quad 0x3fdaf12930000000 + .quad 0x3fdb1b1e00000000 + .quad 0x3fdb44f770000000 + .quad 0x3fdb6eb590000000 + .quad 0x3fdb985890000000 + .quad 0x3fdbc1e080000000 + .quad 0x3fdbeb4d90000000 + .quad 0x3fdc149ff0000000 + .quad 0x3fdc3dd7a0000000 + .quad 0x3fdc66f4e0000000 + .quad 0x3fdc8ff7c0000000 + .quad 0x3fdcb8e070000000 + .quad 0x3fdce1af00000000 + .quad 0x3fdd0a63a0000000 + .quad 0x3fdd32fe70000000 + .quad 0x3fdd5b7f90000000 + .quad 0x3fdd83e720000000 + .quad 0x3fddac3530000000 + .quad 0x3fddd46a00000000 + .quad 0x3fddfc8590000000 + .quad 0x3fde248810000000 + .quad 0x3fde4c71a0000000 + .quad 0x3fde744260000000 + .quad 0x3fde9bfa60000000 + .quad 0x3fdec399d0000000 + .quad 0x3fdeeb20c0000000 + .quad 0x3fdf128f50000000 + .quad 0x3fdf39e5b0000000 + .quad 0x3fdf6123f0000000 + .quad 0x3fdf884a30000000 + .quad 0x3fdfaf5880000000 + .quad 0x3fdfd64f20000000 + .quad 0x3fdffd2e00000000 + .quad 0x3fe011fab0000000 + .quad 0x3fe02552a0000000 + .quad 0x3fe0389ee0000000 + .quad 0x3fe04bdf90000000 + .quad 0x3fe05f14b0000000 + .quad 0x3fe0723e50000000 + .quad 0x3fe0855c80000000 + .quad 0x3fe0986f40000000 + .quad 0x3fe0ab76b0000000 + .quad 0x3fe0be72e0000000 + .quad 0x3fe0d163c0000000 + .quad 0x3fe0e44980000000 + .quad 0x3fe0f72410000000 + .quad 0x3fe109f390000000 + .quad 0x3fe11cb810000000 + .quad 0x3fe12f7190000000 + .quad 0x3fe1422020000000 + .quad 0x3fe154c3d0000000 + .quad 0x3fe1675ca0000000 + .quad 0x3fe179eab0000000 + .quad 0x3fe18c6e00000000 + .quad 0x3fe19ee6b0000000 + .quad 0x3fe1b154b0000000 + .quad 0x3fe1c3b810000000 + .quad 0x3fe1d610f0000000 + .quad 0x3fe1e85f50000000 + .quad 0x3fe1faa340000000 + .quad 0x3fe20cdcd0000000 + .quad 0x3fe21f0bf0000000 + .quad 0x3fe23130d0000000 + .quad 0x3fe2434b60000000 + .quad 0x3fe2555bc0000000 + .quad 0x3fe2676200000000 + .quad 0x3fe2795e10000000 + .quad 0x3fe28b5000000000 + .quad 0x3fe29d37f0000000 + .quad 0x3fe2af15f0000000 + .quad 0x3fe2c0e9e0000000 + .quad 0x3fe2d2b400000000 + .quad 0x3fe2e47430000000 + .quad 0x3fe2f62a90000000 + .quad 0x3fe307d730000000 + .quad 0x3fe3197a00000000 + .quad 0x3fe32b1330000000 + .quad 0x3fe33ca2b0000000 + .quad 0x3fe34e2890000000 + .quad 0x3fe35fa4e0000000 + .quad 0x3fe37117b0000000 + .quad 0x3fe38280f0000000 + .quad 0x3fe393e0d0000000 + .quad 0x3fe3a53730000000 + .quad 0x3fe3b68440000000 + .quad 0x3fe3c7c7f0000000 + .quad 0x3fe3d90260000000 + .quad 0x3fe3ea3390000000 + .quad 0x3fe3fb5b80000000 + .quad 0x3fe40c7a40000000 + .quad 0x3fe41d8fe0000000 + .quad 0x3fe42e9c60000000 + .quad 0x3fe43f9fe0000000 + .quad 0x3fe4509a50000000 + .quad 0x3fe4618bc0000000 + .quad 0x3fe4727430000000 + .quad 0x3fe48353d0000000 + .quad 0x3fe4942a80000000 + .quad 0x3fe4a4f850000000 + .quad 0x3fe4b5bd60000000 + .quad 0x3fe4c679a0000000 + .quad 0x3fe4d72d30000000 + .quad 0x3fe4e7d810000000 + .quad 0x3fe4f87a30000000 + .quad 0x3fe50913c0000000 + .quad 0x3fe519a4c0000000 + .quad 0x3fe52a2d20000000 + .quad 0x3fe53aad00000000 + .quad 0x3fe54b2460000000 + .quad 0x3fe55b9350000000 + .quad 0x3fe56bf9d0000000 + .quad 0x3fe57c57f0000000 + .quad 0x3fe58cadb0000000 + .quad 0x3fe59cfb20000000 + .quad 0x3fe5ad4040000000 + .quad 0x3fe5bd7d30000000 + .quad 0x3fe5cdb1d0000000 + .quad 0x3fe5ddde50000000 + .quad 0x3fe5ee02a0000000 + .quad 0x3fe5fe1ed0000000 + .quad 0x3fe60e32f0000000 + .quad 0x3fe61e3ef0000000 + .quad 0x3fe62e42e0000000 + .quad 0x0000000000000000 + +.align 16 +.L__log_256_tail: + .quad 0x0000000000000000 + .quad 0x3db5885e0250435a + .quad 0x3de620cf11f86ed2 + .quad 0x3dff0214edba4a25 + .quad 0x3dbf807c79f3db4e + .quad 0x3dea352ba779a52b + .quad 0x3dff56c46aa49fd5 + .quad 0x3dfebe465fef5196 + .quad 0x3e0cf0660099f1f8 + .quad 0x3e1247b2ff85945d + .quad 0x3e13fd7abf5202b6 + .quad 0x3e1f91c9a918d51e + .quad 0x3e08cb73f118d3ca + .quad 0x3e1d91c7d6fad074 + .quad 0x3de1971bec28d14c + .quad 0x3e15b616a423c78a + .quad 0x3da162a6617cc971 + .quad 0x3e166391c4c06d29 + .quad 0x3e2d46f5c1d0c4b8 + .quad 0x3e2e14282df1f6d3 + .quad 0x3e186f47424a660d + .quad 0x3e2d4c8de077753e + .quad 0x3e2e0c307ed24f1c + .quad 0x3e226ea18763bdd3 + .quad 0x3e25cad69737c933 + .quad 0x3e2af62599088901 + .quad 0x3e18c66c83d6b2d0 + .quad 0x3e1880ceb36fb30f + .quad 0x3e2495aac6ca17a4 + .quad 0x3e2761db4210878c + .quad 0x3e2eb78e862bac2f + .quad 0x3e19b2cd75790dd9 + .quad 0x3e2c55e5cbd3d50f + .quad 0x3db162a6617cc971 + .quad 0x3dfdbeabaaa2e519 + .quad 0x3e1652cb7150c647 + .quad 0x3e39a11cb2cd2ee2 + .quad 0x3e219d0ab1a28813 + .quad 0x3e24bd9e80a41811 + .quad 0x3e3214b596faa3df + .quad 0x3e303fea46980bb8 + .quad 0x3e31c8ffa5fd28c7 + .quad 0x3dce8f743bcd96c5 + .quad 0x3dfd98c5395315c6 + .quad 0x3e3996fa3ccfa7b2 + .quad 0x3e1cd2af2ad13037 + .quad 0x3e1d0da1bd17200e + .quad 0x3e3330410ba68b75 + .quad 0x3df4f27a790e7c41 + .quad 0x3e13956a86f6ff1b + .quad 0x3e2c6748723551d9 + .quad 0x3e2500de9326cdfc + .quad 0x3e1086c848df1b59 + .quad 0x3e04357ead6836ff + .quad 0x3e24832442408024 + .quad 0x3e3d10da8154b13d + .quad 0x3e39e8ad68ec8260 + .quad 0x3e3cfbf706abaf18 + .quad 0x3e3fc56ac6326e23 + .quad 0x3e39105e3185cf21 + .quad 0x3e3d017fe5b19cc0 + .quad 0x3e3d1f6b48dd13fe + .quad 0x3e20b63358a7e73a + .quad 0x3e263063028c211c + .quad 0x3e2e6a6886b09760 + .quad 0x3e3c138bb891cd03 + .quad 0x3e369f7722b7221a + .quad 0x3df57d8fac1a628c + .quad 0x3e3c55e5cbd3d50f + .quad 0x3e1552d2ff48fe2e + .quad 0x3e37b8b26ca431bc + .quad 0x3e292decdc1c5f6d + .quad 0x3e3abc7c551aaa8c + .quad 0x3e36b540731a354b + .quad 0x3e32d341036b89ef + .quad 0x3e4f9ab21a3a2e0f + .quad 0x3e239c871afb9fbd + .quad 0x3e3e6add2c81f640 + .quad 0x3e435c95aa313f41 + .quad 0x3e249d4582f6cc53 + .quad 0x3e47574c1c07398f + .quad 0x3e4ba846dece9e8d + .quad 0x3e16999fafbc68e7 + .quad 0x3e4c9145e51b0103 + .quad 0x3e479ef2cb44850a + .quad 0x3e0beec73de11275 + .quad 0x3e2ef4351af5a498 + .quad 0x3e45713a493b4a50 + .quad 0x3e45c23a61385992 + .quad 0x3e42a88309f57299 + .quad 0x3e4530faa9ac8ace + .quad 0x3e25fec2d792a758 + .quad 0x3e35a517a71cbcd7 + .quad 0x3e3707dc3e1cd9a3 + .quad 0x3e3a1a9f8ef43049 + .quad 0x3e4409d0276b3674 + .quad 0x3e20e2f613e85bd9 + .quad 0x3df0027433001e5f + .quad 0x3e35dde2836d3265 + .quad 0x3e2300134d7aaf04 + .quad 0x3e3cb7e0b42724f5 + .quad 0x3e2d6e93167e6308 + .quad 0x3e3d1569b1526adb + .quad 0x3e0e99fc338a1a41 + .quad 0x3e4eb01394a11b1c + .quad 0x3e04f27a790e7c41 + .quad 0x3e25ce3ca97b7af9 + .quad 0x3e281f0f940ed857 + .quad 0x3e4d36295d88857c + .quad 0x3e21aca1ec4af526 + .quad 0x3e445743c7182726 + .quad 0x3e23c491aead337e + .quad 0x3e3aef401a738931 + .quad 0x3e21cede76092a29 + .quad 0x3e4fba8f44f82bb4 + .quad 0x3e446f5f7f3c3e1a + .quad 0x3e47055f86c9674b + .quad 0x3e4b41a92b6b6e1a + .quad 0x3e443d162e927628 + .quad 0x3e4466174013f9b1 + .quad 0x3e3b05096ad69c62 + .quad 0x3e40b169150faa58 + .quad 0x3e3cd98b1df85da7 + .quad 0x3e468b507b0f8fa8 + .quad 0x3e48422df57499ba + .quad 0x3e11351586970274 + .quad 0x3e117e08acba92ee + .quad 0x3e26e04314dd0229 + .quad 0x3e497f3097e56d1a + .quad 0x3e3356e655901286 + .quad 0x3e0cb761457f94d6 + .quad 0x3e39af67a85a9dac + .quad 0x3e453410931a909f + .quad 0x3e22c587206058f5 + .quad 0x3e223bc358899c22 + .quad 0x3e4d7bf8b6d223cb + .quad 0x3e47991ec5197ddb + .quad 0x3e4a79e6bb3a9219 + .quad 0x3e3a4c43ed663ec5 + .quad 0x3e461b5a1484f438 + .quad 0x3e4b4e36f7ef0c3a + .quad 0x3e115f026acd0d1b + .quad 0x3e3f36b535cecf05 + .quad 0x3e2ffb7fbf3eb5c6 + .quad 0x3e3e6a6886b09760 + .quad 0x3e3135eb27f5bbc3 + .quad 0x3e470be7d6f6fa57 + .quad 0x3e4ce43cc84ab338 + .quad 0x3e4c01d7aac3bd91 + .quad 0x3e45c58d07961060 + .quad 0x3e3628bcf941456e + .quad 0x3e4c58b2a8461cd2 + .quad 0x3e33071282fb989a + .quad 0x3e420dab6a80f09c + .quad 0x3e44f8d84c397b1e + .quad 0x3e40d0ee08599e48 + .quad 0x3e1d68787e37da36 + .quad 0x3e366187d591bafc + .quad 0x3e22346600bae772 + .quad 0x3e390377d0d61b8e + .quad 0x3e4f5e0dd966b907 + .quad 0x3e49023cb79a00e2 + .quad 0x3e44e05158c28ad8 + .quad 0x3e3bfa7b08b18ae4 + .quad 0x3e4ef1e63db35f67 + .quad 0x3e0ec2ae39493d4f + .quad 0x3e40afe930ab2fa0 + .quad 0x3e225ff8a1810dd4 + .quad 0x3e469743fb1a71a5 + .quad 0x3e5f9cc676785571 + .quad 0x3e5b524da4cbf982 + .quad 0x3e5a4c8b381535b8 + .quad 0x3e5839be809caf2c + .quad 0x3e50968a1cb82c13 + .quad 0x3e5eae6a41723fb5 + .quad 0x3e5d9c29a380a4db + .quad 0x3e4094aa0ada625e + .quad 0x3e5973ad6fc108ca + .quad 0x3e4747322fdbab97 + .quad 0x3e593692fa9d4221 + .quad 0x3e5c5a992dfbc7d9 + .quad 0x3e4e1f33e102387a + .quad 0x3e464fbef14c048c + .quad 0x3e4490f513ca5e3b + .quad 0x3e37a6af4d4c799d + .quad 0x3e57574c1c07398f + .quad 0x3e57b133417f8c1c + .quad 0x3e5feb9e0c176514 + .quad 0x3e419f25bb3172f7 + .quad 0x3e45f68a7bbfb852 + .quad 0x3e5ee278497929f1 + .quad 0x3e5ccee006109d58 + .quad 0x3e5ce081a07bd8b3 + .quad 0x3e570e12981817b8 + .quad 0x3e292ab6d93503d0 + .quad 0x3e58cb7dd7c3b61e + .quad 0x3e4efafd0a0b78da + .quad 0x3e5e907267c4288e + .quad 0x3e5d31ef96780875 + .quad 0x3e23430dfcd2ad50 + .quad 0x3e344d88d75bc1f9 + .quad 0x3e5bec0f055e04fc + .quad 0x3e5d85611590b9ad + .quad 0x3df320568e583229 + .quad 0x3e5a891d1772f538 + .quad 0x3e22edc9dabba74d + .quad 0x3e4b9009a1015086 + .quad 0x3e52a12a8c5b1a19 + .quad 0x3e3a7885f0fdac85 + .quad 0x3e5f4ffcd43ac691 + .quad 0x3e52243ae2640aad + .quad 0x3e546513299035d3 + .quad 0x3e5b39c3a62dd725 + .quad 0x3e5ba6dd40049f51 + .quad 0x3e451d1ed7177409 + .quad 0x3e5cb0f2fd7f5216 + .quad 0x3e3ab150cd4e2213 + .quad 0x3e5cfd7bf3193844 + .quad 0x3e53fff8455f1dbd + .quad 0x3e5fee640b905fc9 + .quad 0x3e54e2adf548084c + .quad 0x3e3b597adc1ecdd2 + .quad 0x3e4345bd096d3a75 + .quad 0x3e5101b9d2453c8b + .quad 0x3e508ce55cc8c979 + .quad 0x3e5bbf017e595f71 + .quad 0x3e37ce733bd393dc + .quad 0x3e233bb0a503f8a1 + .quad 0x3e30e2f613e85bd9 + .quad 0x3e5e67555a635b3c + .quad 0x3e2ea88df73d5e8b + .quad 0x3e3d17e03bda18a8 + .quad 0x3e5b607d76044f7e + .quad 0x3e52adc4e71bc2fc + .quad 0x3e5f99dc7362d1d9 + .quad 0x3e5473fa008e6a6a + .quad 0x3e2b75bb09cb0985 + .quad 0x3e5ea04dd10b9aba + .quad 0x3e5802d0d6979674 + .quad 0x3e174688ccd99094 + .quad 0x3e496f16abb9df22 + .quad 0x3e46e66df2aa374f + .quad 0x3e4e66525ea4550a + .quad 0x3e42d02f34f20cbd + .quad 0x3e46cfce65047188 + .quad 0x3e39b78c842d58b8 + .quad 0x3e4735e624c24bc9 + .quad 0x3e47eba1f7dd1adf + .quad 0x3e586b3e59f65355 + .quad 0x3e1ce38e637f1b4d + .quad 0x3e58d82ec919edc7 + .quad 0x3e4c52648ddcfa37 + .quad 0x3e52482ceae1ac12 + .quad 0x3e55a312311aba4f + .quad 0x3e411e236329f225 + .quad 0x3e5b48c8cd2f246c + .quad 0x3e6efa39ef35793c + .quad 0x0000000000000000 + +.align 16 +.L__log_F_inv: + .quad 0x4000000000000000 + .quad 0x3fffe01fe01fe020 + .quad 0x3fffc07f01fc07f0 + .quad 0x3fffa11caa01fa12 + .quad 0x3fff81f81f81f820 + .quad 0x3fff6310aca0dbb5 + .quad 0x3fff44659e4a4271 + .quad 0x3fff25f644230ab5 + .quad 0x3fff07c1f07c1f08 + .quad 0x3ffee9c7f8458e02 + .quad 0x3ffecc07b301ecc0 + .quad 0x3ffeae807aba01eb + .quad 0x3ffe9131abf0b767 + .quad 0x3ffe741aa59750e4 + .quad 0x3ffe573ac901e574 + .quad 0x3ffe3a9179dc1a73 + .quad 0x3ffe1e1e1e1e1e1e + .quad 0x3ffe01e01e01e01e + .quad 0x3ffde5d6e3f8868a + .quad 0x3ffdca01dca01dca + .quad 0x3ffdae6076b981db + .quad 0x3ffd92f2231e7f8a + .quad 0x3ffd77b654b82c34 + .quad 0x3ffd5cac807572b2 + .quad 0x3ffd41d41d41d41d + .quad 0x3ffd272ca3fc5b1a + .quad 0x3ffd0cb58f6ec074 + .quad 0x3ffcf26e5c44bfc6 + .quad 0x3ffcd85689039b0b + .quad 0x3ffcbe6d9601cbe7 + .quad 0x3ffca4b3055ee191 + .quad 0x3ffc8b265afb8a42 + .quad 0x3ffc71c71c71c71c + .quad 0x3ffc5894d10d4986 + .quad 0x3ffc3f8f01c3f8f0 + .quad 0x3ffc26b5392ea01c + .quad 0x3ffc0e070381c0e0 + .quad 0x3ffbf583ee868d8b + .quad 0x3ffbdd2b899406f7 + .quad 0x3ffbc4fd65883e7b + .quad 0x3ffbacf914c1bad0 + .quad 0x3ffb951e2b18ff23 + .quad 0x3ffb7d6c3dda338b + .quad 0x3ffb65e2e3beee05 + .quad 0x3ffb4e81b4e81b4f + .quad 0x3ffb37484ad806ce + .quad 0x3ffb2036406c80d9 + .quad 0x3ffb094b31d922a4 + .quad 0x3ffaf286bca1af28 + .quad 0x3ffadbe87f94905e + .quad 0x3ffac5701ac5701b + .quad 0x3ffaaf1d2f87ebfd + .quad 0x3ffa98ef606a63be + .quad 0x3ffa82e65130e159 + .quad 0x3ffa6d01a6d01a6d + .quad 0x3ffa574107688a4a + .quad 0x3ffa41a41a41a41a + .quad 0x3ffa2c2a87c51ca0 + .quad 0x3ffa16d3f97a4b02 + .quad 0x3ffa01a01a01a01a + .quad 0x3ff9ec8e951033d9 + .quad 0x3ff9d79f176b682d + .quad 0x3ff9c2d14ee4a102 + .quad 0x3ff9ae24ea5510da + .quad 0x3ff999999999999a + .quad 0x3ff9852f0d8ec0ff + .quad 0x3ff970e4f80cb872 + .quad 0x3ff95cbb0be377ae + .quad 0x3ff948b0fcd6e9e0 + .quad 0x3ff934c67f9b2ce6 + .quad 0x3ff920fb49d0e229 + .quad 0x3ff90d4f120190d5 + .quad 0x3ff8f9c18f9c18fa + .quad 0x3ff8e6527af1373f + .quad 0x3ff8d3018d3018d3 + .quad 0x3ff8bfce8062ff3a + .quad 0x3ff8acb90f6bf3aa + .quad 0x3ff899c0f601899c + .quad 0x3ff886e5f0abb04a + .quad 0x3ff87427bcc092b9 + .quad 0x3ff8618618618618 + .quad 0x3ff84f00c2780614 + .quad 0x3ff83c977ab2bedd + .quad 0x3ff82a4a0182a4a0 + .quad 0x3ff8181818181818 + .quad 0x3ff8060180601806 + .quad 0x3ff7f405fd017f40 + .quad 0x3ff7e225515a4f1d + .quad 0x3ff7d05f417d05f4 + .quad 0x3ff7beb3922e017c + .quad 0x3ff7ad2208e0ecc3 + .quad 0x3ff79baa6bb6398b + .quad 0x3ff78a4c8178a4c8 + .quad 0x3ff77908119ac60d + .quad 0x3ff767dce434a9b1 + .quad 0x3ff756cac201756d + .quad 0x3ff745d1745d1746 + .quad 0x3ff734f0c541fe8d + .quad 0x3ff724287f46debc + .quad 0x3ff713786d9c7c09 + .quad 0x3ff702e05c0b8170 + .quad 0x3ff6f26016f26017 + .quad 0x3ff6e1f76b4337c7 + .quad 0x3ff6d1a62681c861 + .quad 0x3ff6c16c16c16c17 + .quad 0x3ff6b1490aa31a3d + .quad 0x3ff6a13cd1537290 + .quad 0x3ff691473a88d0c0 + .quad 0x3ff6816816816817 + .quad 0x3ff6719f3601671a + .quad 0x3ff661ec6a5122f9 + .quad 0x3ff6524f853b4aa3 + .quad 0x3ff642c8590b2164 + .quad 0x3ff63356b88ac0de + .quad 0x3ff623fa77016240 + .quad 0x3ff614b36831ae94 + .quad 0x3ff6058160581606 + .quad 0x3ff5f66434292dfc + .quad 0x3ff5e75bb8d015e7 + .quad 0x3ff5d867c3ece2a5 + .quad 0x3ff5c9882b931057 + .quad 0x3ff5babcc647fa91 + .quad 0x3ff5ac056b015ac0 + .quad 0x3ff59d61f123ccaa + .quad 0x3ff58ed2308158ed + .quad 0x3ff5805601580560 + .quad 0x3ff571ed3c506b3a + .quad 0x3ff56397ba7c52e2 + .quad 0x3ff5555555555555 + .quad 0x3ff54725e6bb82fe + .quad 0x3ff5390948f40feb + .quad 0x3ff52aff56a8054b + .quad 0x3ff51d07eae2f815 + .quad 0x3ff50f22e111c4c5 + .quad 0x3ff5015015015015 + .quad 0x3ff4f38f62dd4c9b + .quad 0x3ff4e5e0a72f0539 + .quad 0x3ff4d843bedc2c4c + .quad 0x3ff4cab88725af6e + .quad 0x3ff4bd3edda68fe1 + .quad 0x3ff4afd6a052bf5b + .quad 0x3ff4a27fad76014a + .quad 0x3ff49539e3b2d067 + .quad 0x3ff4880522014880 + .quad 0x3ff47ae147ae147b + .quad 0x3ff46dce34596066 + .quad 0x3ff460cbc7f5cf9a + .quad 0x3ff453d9e2c776ca + .quad 0x3ff446f86562d9fb + .quad 0x3ff43a2730abee4d + .quad 0x3ff42d6625d51f87 + .quad 0x3ff420b5265e5951 + .quad 0x3ff4141414141414 + .quad 0x3ff40782d10e6566 + .quad 0x3ff3fb013fb013fb + .quad 0x3ff3ee8f42a5af07 + .quad 0x3ff3e22cbce4a902 + .quad 0x3ff3d5d991aa75c6 + .quad 0x3ff3c995a47babe7 + .quad 0x3ff3bd60d9232955 + .quad 0x3ff3b13b13b13b14 + .quad 0x3ff3a524387ac822 + .quad 0x3ff3991c2c187f63 + .quad 0x3ff38d22d366088e + .quad 0x3ff3813813813814 + .quad 0x3ff3755bd1c945ee + .quad 0x3ff3698df3de0748 + .quad 0x3ff35dce5f9f2af8 + .quad 0x3ff3521cfb2b78c1 + .quad 0x3ff34679ace01346 + .quad 0x3ff33ae45b57bcb2 + .quad 0x3ff32f5ced6a1dfa + .quad 0x3ff323e34a2b10bf + .quad 0x3ff3187758e9ebb6 + .quad 0x3ff30d190130d190 + .quad 0x3ff301c82ac40260 + .quad 0x3ff2f684bda12f68 + .quad 0x3ff2eb4ea1fed14b + .quad 0x3ff2e025c04b8097 + .quad 0x3ff2d50a012d50a0 + .quad 0x3ff2c9fb4d812ca0 + .quad 0x3ff2bef98e5a3711 + .quad 0x3ff2b404ad012b40 + .quad 0x3ff2a91c92f3c105 + .quad 0x3ff29e4129e4129e + .quad 0x3ff293725bb804a5 + .quad 0x3ff288b01288b013 + .quad 0x3ff27dfa38a1ce4d + .quad 0x3ff27350b8812735 + .quad 0x3ff268b37cd60127 + .quad 0x3ff25e22708092f1 + .quad 0x3ff2539d7e9177b2 + .quad 0x3ff2492492492492 + .quad 0x3ff23eb79717605b + .quad 0x3ff23456789abcdf + .quad 0x3ff22a0122a0122a + .quad 0x3ff21fb78121fb78 + .quad 0x3ff21579804855e6 + .quad 0x3ff20b470c67c0d9 + .quad 0x3ff2012012012012 + .quad 0x3ff1f7047dc11f70 + .quad 0x3ff1ecf43c7fb84c + .quad 0x3ff1e2ef3b3fb874 + .quad 0x3ff1d8f5672e4abd + .quad 0x3ff1cf06ada2811d + .quad 0x3ff1c522fc1ce059 + .quad 0x3ff1bb4a4046ed29 + .quad 0x3ff1b17c67f2bae3 + .quad 0x3ff1a7b9611a7b96 + .quad 0x3ff19e0119e0119e + .quad 0x3ff19453808ca29c + .quad 0x3ff18ab083902bdb + .quad 0x3ff1811811811812 + .quad 0x3ff1778a191bd684 + .quad 0x3ff16e0689427379 + .quad 0x3ff1648d50fc3201 + .quad 0x3ff15b1e5f75270d + .quad 0x3ff151b9a3fdd5c9 + .quad 0x3ff1485f0e0acd3b + .quad 0x3ff13f0e8d344724 + .quad 0x3ff135c81135c811 + .quad 0x3ff12c8b89edc0ac + .quad 0x3ff12358e75d3033 + .quad 0x3ff11a3019a74826 + .quad 0x3ff1111111111111 + .quad 0x3ff107fbbe011080 + .quad 0x3ff0fef010fef011 + .quad 0x3ff0f5edfab325a2 + .quad 0x3ff0ecf56be69c90 + .quad 0x3ff0e40655826011 + .quad 0x3ff0db20a88f4696 + .quad 0x3ff0d24456359e3a + .quad 0x3ff0c9714fbcda3b + .quad 0x3ff0c0a7868b4171 + .quad 0x3ff0b7e6ec259dc8 + .quad 0x3ff0af2f722eecb5 + .quad 0x3ff0a6810a6810a7 + .quad 0x3ff09ddba6af8360 + .quad 0x3ff0953f39010954 + .quad 0x3ff08cabb37565e2 + .quad 0x3ff0842108421084 + .quad 0x3ff07b9f29b8eae2 + .quad 0x3ff073260a47f7c6 + .quad 0x3ff06ab59c7912fb + .quad 0x3ff0624dd2f1a9fc + .quad 0x3ff059eea0727586 + .quad 0x3ff05197f7d73404 + .quad 0x3ff04949cc1664c5 + .quad 0x3ff0410410410410 + .quad 0x3ff038c6b78247fc + .quad 0x3ff03091b51f5e1a + .quad 0x3ff02864fc7729e9 + .quad 0x3ff0204081020408 + .quad 0x3ff0182436517a37 + .quad 0x3ff0101010101010 + .quad 0x3ff0080402010080 + .quad 0x3ff0000000000000 + .quad 0x0000000000000000 + +#endif
diff --git a/src/gas/log10.S b/src/gas/log10.S new file mode 100644 index 0000000..90522ef --- /dev/null +++ b/src/gas/log10.S
@@ -0,0 +1,1146 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# +# log10.S +# +# An implementation of the log10 libm function. +# +# Prototype: +# +# double log10(double x); +# + +# +# Algorithm: +# Similar to one presnted in log.S +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(log10) +#define fname_special _log10_special@PLT + + +# local variable storage offsets +.equ p_temp, 0x0 +.equ stack_size, 0x18 + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + sub $stack_size, %rsp + + # compute exponent part + xor %rax, %rax + movdqa %xmm0, %xmm3 + movsd %xmm0, %xmm4 + psrlq $52, %xmm3 + movd %xmm0, %rax + psubq .L__mask_1023(%rip), %xmm3 + movdqa %xmm0, %xmm2 + cvtdq2pd %xmm3, %xmm6 # xexp + + # NaN or inf + movdqa %xmm0, %xmm5 + andpd .L__real_inf(%rip), %xmm5 + comisd .L__real_inf(%rip), %xmm5 + je .L__x_is_inf_or_nan + + # check for negative numbers or zero + xorpd %xmm5, %xmm5 + comisd %xmm5, %xmm0 + jbe .L__x_is_zero_or_neg + + pand .L__real_mant(%rip), %xmm2 + subsd .L__real_one(%rip), %xmm4 + + comisd .L__mask_1023_f(%rip), %xmm6 + je .L__denormal_adjust + +.L__continue_common: + + # compute index into the log tables + mov %rax, %r9 + and .L__mask_mant_all8(%rip), %rax + and .L__mask_mant9(%rip), %r9 + shl $1, %r9 + add %r9, %rax + mov %rax, p_temp(%rsp) + + # near one codepath + andpd .L__real_notsign(%rip), %xmm4 + comisd .L__real_threshold(%rip), %xmm4 + jb .L__near_one + + # F, Y + movsd p_temp(%rsp), %xmm1 + shr $44, %rax + por .L__real_half(%rip), %xmm2 + por .L__real_half(%rip), %xmm1 + lea .L__log_F_inv(%rip), %r9 + + # f = F - Y, r = f * inv + subsd %xmm2, %xmm1 + mulsd (%r9,%rax,8), %xmm1 + + movsd %xmm1, %xmm2 + movsd %xmm1, %xmm0 + lea .L__log_256_lead(%rip), %r9 + + # poly + movsd .L__real_1_over_6(%rip), %xmm3 + movsd .L__real_1_over_3(%rip), %xmm1 + mulsd %xmm2, %xmm3 + mulsd %xmm2, %xmm1 + mulsd %xmm2, %xmm0 + movsd %xmm0, %xmm4 + addsd .L__real_1_over_5(%rip), %xmm3 + addsd .L__real_1_over_2(%rip), %xmm1 + mulsd %xmm0, %xmm4 + mulsd %xmm2, %xmm3 + mulsd %xmm0, %xmm1 + addsd .L__real_1_over_4(%rip), %xmm3 + addsd %xmm2, %xmm1 + mulsd %xmm4, %xmm3 + addsd %xmm3, %xmm1 + + mulsd .L__real_log10_e(%rip), %xmm1 + + # m*log(10) + log10(G) - poly + movsd .L__real_log10_2_tail(%rip), %xmm5 + mulsd %xmm6, %xmm5 + subsd %xmm1, %xmm5 + + movsd (%r9,%rax,8), %xmm0 + lea .L__log_256_tail(%rip), %rdx + movsd (%rdx,%rax,8), %xmm2 + addsd %xmm5, %xmm2 + + movsd .L__real_log10_2_lead(%rip), %xmm4 + mulsd %xmm6, %xmm4 + addsd %xmm4, %xmm0 + + addsd %xmm2, %xmm0 + + add $stack_size, %rsp + ret + +.p2align 4,,15 +.L__near_one: + + # r = x - 1.0 + movsd .L__real_two(%rip), %xmm2 + subsd .L__real_one(%rip), %xmm0 # r + + addsd %xmm0, %xmm2 + movsd %xmm0, %xmm1 + divsd %xmm2, %xmm1 # r/(2+r) = u/2 + + movsd .L__real_ca2(%rip), %xmm4 + movsd .L__real_ca4(%rip), %xmm5 + + movsd %xmm0, %xmm6 + mulsd %xmm1, %xmm6 # correction + + addsd %xmm1, %xmm1 # u + movsd %xmm1, %xmm2 + + mulsd %xmm1, %xmm2 # u^2 + + mulsd %xmm2, %xmm4 + mulsd %xmm2, %xmm5 + + addsd .L__real_ca1(%rip), %xmm4 + addsd .L__real_ca3(%rip), %xmm5 + + mulsd %xmm1, %xmm2 # u^3 + mulsd %xmm2, %xmm4 + + mulsd %xmm2, %xmm2 + mulsd %xmm1, %xmm2 # u^7 + mulsd %xmm2, %xmm5 + + addsd %xmm5, %xmm4 + subsd %xmm6, %xmm4 + + movdqa %xmm0, %xmm3 + pand .L__mask_lower(%rip), %xmm3 + subsd %xmm3, %xmm0 + addsd %xmm0, %xmm4 + + movsd %xmm3, %xmm0 + movsd %xmm4, %xmm1 + + mulsd .L__real_log10_e_tail(%rip), %xmm4 + mulsd .L__real_log10_e_tail(%rip), %xmm0 + mulsd .L__real_log10_e_lead(%rip), %xmm1 + mulsd .L__real_log10_e_lead(%rip), %xmm3 + + addsd %xmm4, %xmm0 + addsd %xmm1, %xmm0 + addsd %xmm3, %xmm0 + + add $stack_size, %rsp + ret + +.L__denormal_adjust: + por .L__real_one(%rip), %xmm2 + subsd .L__real_one(%rip), %xmm2 + movsd %xmm2, %xmm5 + pand .L__real_mant(%rip), %xmm2 + movd %xmm2, %rax + psrlq $52, %xmm5 + psubd .L__mask_2045(%rip), %xmm5 + cvtdq2pd %xmm5, %xmm6 + jmp .L__continue_common + +.p2align 4,,15 +.L__x_is_zero_or_neg: + jne .L__x_is_neg + + movsd .L__real_ninf(%rip), %xmm1 + mov .L__flag_x_zero(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__x_is_neg: + + movsd .L__real_qnan(%rip), %xmm1 + mov .L__flag_x_neg(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__x_is_inf_or_nan: + + cmp .L__real_inf(%rip), %rax + je .L__finish + + cmp .L__real_ninf(%rip), %rax + je .L__x_is_neg + + mov .L__real_qnanbit(%rip), %r9 + and %rax, %r9 + jnz .L__finish + + or .L__real_qnanbit(%rip), %rax + movd %rax, %xmm1 + mov .L__flag_x_nan(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__finish: + add $stack_size, %rsp + ret + + +.data + +.align 16 + +# these codes and the ones in the corresponding .c file have to match +.L__flag_x_zero: .long 00000001 +.L__flag_x_neg: .long 00000002 +.L__flag_x_nan: .long 00000003 + +.align 16 + +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0000000000000000 +.L__real_inf: .quad 0x7ff0000000000000 # +inf + .quad 0x0000000000000000 +.L__real_qnan: .quad 0x7ff8000000000000 # qNaN + .quad 0x0000000000000000 +.L__real_qnanbit: .quad 0x0008000000000000 + .quad 0x0000000000000000 +.L__real_mant: .quad 0x000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000000000000000 +.L__mask_1023: .quad 0x00000000000003ff + .quad 0x0000000000000000 +.L__mask_001: .quad 0x0000000000000001 + .quad 0x0000000000000000 + +.L__mask_mant_all8: .quad 0x000ff00000000000 + .quad 0x0000000000000000 +.L__mask_mant9: .quad 0x0000080000000000 + .quad 0x0000000000000000 + +.L__real_log10_e: .quad 0x3fdbcb7b1526e50e + .quad 0x0000000000000000 + +.L__real_log10_e_lead: .quad 0x3fdbcb7800000000 # log10e_lead 4.34293746948242187500e-01 + .quad 0x0000000000000000 +.L__real_log10_e_tail: .quad 0x3ea8a93728719535 # log10e_tail 7.3495500964015109100644e-7 + .quad 0x0000000000000000 + +.L__real_log10_2_lead: .quad 0x3fd3441350000000 + .quad 0x0000000000000000 +.L__real_log10_2_tail: .quad 0x3e03ef3fde623e25 + .quad 0x0000000000000000 + + + + +.L__real_two: .quad 0x4000000000000000 # 2 + .quad 0x0000000000000000 + +.L__real_one: .quad 0x3ff0000000000000 # 1 + .quad 0x0000000000000000 + +.L__real_half: .quad 0x3fe0000000000000 # 1/2 + .quad 0x0000000000000000 + +.L__mask_100: .quad 0x0000000000000100 + .quad 0x0000000000000000 + +.L__real_1_over_512: .quad 0x3f60000000000000 + .quad 0x0000000000000000 + +.L__real_1_over_2: .quad 0x3fe0000000000000 + .quad 0x0000000000000000 +.L__real_1_over_3: .quad 0x3fd5555555555555 + .quad 0x0000000000000000 +.L__real_1_over_4: .quad 0x3fd0000000000000 + .quad 0x0000000000000000 +.L__real_1_over_5: .quad 0x3fc999999999999a + .quad 0x0000000000000000 +.L__real_1_over_6: .quad 0x3fc5555555555555 + .quad 0x0000000000000000 + +.L__mask_1023_f: .quad 0x0c08ff80000000000 + .quad 0x0000000000000000 + +.L__mask_2045: .quad 0x00000000000007fd + .quad 0x0000000000000000 + +.L__real_threshold: .quad 0x3fb0000000000000 # .0625 + .quad 0x0000000000000000 + +.L__real_notsign: .quad 0x7ffFFFFFFFFFFFFF # ^sign bit + .quad 0x0000000000000000 + +.L__real_ca1: .quad 0x3fb55555555554e6 # 8.33333333333317923934e-02 + .quad 0x0000000000000000 +.L__real_ca2: .quad 0x3f89999999bac6d4 # 1.25000000037717509602e-02 + .quad 0x0000000000000000 +.L__real_ca3: .quad 0x3f62492307f1519f # 2.23213998791944806202e-03 + .quad 0x0000000000000000 +.L__real_ca4: .quad 0x3f3c8034c85dfff0 # 4.34887777707614552256e-04 + .quad 0x0000000000000000 + +.L__mask_lower: .quad 0x0ffffffff00000000 + .quad 0x0000000000000000 + +.align 16 +.L__log_256_lead: + .quad 0x0000000000000000 + .quad 0x3f5bbd9e90000000 + .quad 0x3f6bafd470000000 + .quad 0x3f74b99560000000 + .quad 0x3f7b9476a0000000 + .quad 0x3f81344da0000000 + .quad 0x3f849b0850000000 + .quad 0x3f87fe71c0000000 + .quad 0x3f8b5e9080000000 + .quad 0x3f8ebb6af0000000 + .quad 0x3f910a83a0000000 + .quad 0x3f92b5b5e0000000 + .quad 0x3f945f4f50000000 + .quad 0x3f96075300000000 + .quad 0x3f97adc3d0000000 + .quad 0x3f9952a4f0000000 + .quad 0x3f9af5f920000000 + .quad 0x3f9c97c370000000 + .quad 0x3f9e3806a0000000 + .quad 0x3f9fd6c5b0000000 + .quad 0x3fa0ba01a0000000 + .quad 0x3fa187e120000000 + .quad 0x3fa25502c0000000 + .quad 0x3fa32167c0000000 + .quad 0x3fa3ed1190000000 + .quad 0x3fa4b80180000000 + .quad 0x3fa58238e0000000 + .quad 0x3fa64bb910000000 + .quad 0x3fa7148340000000 + .quad 0x3fa7dc98c0000000 + .quad 0x3fa8a3fad0000000 + .quad 0x3fa96aaac0000000 + .quad 0x3faa30a9d0000000 + .quad 0x3faaf5f920000000 + .quad 0x3fabba9a00000000 + .quad 0x3fac7e8d90000000 + .quad 0x3fad41d510000000 + .quad 0x3fae0471a0000000 + .quad 0x3faec66470000000 + .quad 0x3faf87aeb0000000 + .quad 0x3fb02428c0000000 + .quad 0x3fb08426f0000000 + .quad 0x3fb0e3d290000000 + .quad 0x3fb1432c30000000 + .quad 0x3fb1a23440000000 + .quad 0x3fb200eb60000000 + .quad 0x3fb25f5210000000 + .quad 0x3fb2bd68e0000000 + .quad 0x3fb31b3050000000 + .quad 0x3fb378a8e0000000 + .quad 0x3fb3d5d330000000 + .quad 0x3fb432afa0000000 + .quad 0x3fb48f3ed0000000 + .quad 0x3fb4eb8120000000 + .quad 0x3fb5477730000000 + .quad 0x3fb5a32160000000 + .quad 0x3fb5fe8040000000 + .quad 0x3fb6599440000000 + .quad 0x3fb6b45df0000000 + .quad 0x3fb70eddb0000000 + .quad 0x3fb7691400000000 + .quad 0x3fb7c30160000000 + .quad 0x3fb81ca630000000 + .quad 0x3fb8760300000000 + .quad 0x3fb8cf1830000000 + .quad 0x3fb927e640000000 + .quad 0x3fb9806d90000000 + .quad 0x3fb9d8aea0000000 + .quad 0x3fba30a9d0000000 + .quad 0x3fba885fa0000000 + .quad 0x3fbadfd070000000 + .quad 0x3fbb36fcb0000000 + .quad 0x3fbb8de4d0000000 + .quad 0x3fbbe48930000000 + .quad 0x3fbc3aea40000000 + .quad 0x3fbc910870000000 + .quad 0x3fbce6e410000000 + .quad 0x3fbd3c7da0000000 + .quad 0x3fbd91d580000000 + .quad 0x3fbde6ec00000000 + .quad 0x3fbe3bc1a0000000 + .quad 0x3fbe9056b0000000 + .quad 0x3fbee4aba0000000 + .quad 0x3fbf38c0c0000000 + .quad 0x3fbf8c9680000000 + .quad 0x3fbfe02d30000000 + .quad 0x3fc019c2a0000000 + .quad 0x3fc0434f70000000 + .quad 0x3fc06cbd60000000 + .quad 0x3fc0960c80000000 + .quad 0x3fc0bf3d00000000 + .quad 0x3fc0e84f10000000 + .quad 0x3fc11142f0000000 + .quad 0x3fc13a18a0000000 + .quad 0x3fc162d080000000 + .quad 0x3fc18b6a90000000 + .quad 0x3fc1b3e710000000 + .quad 0x3fc1dc4630000000 + .quad 0x3fc2048810000000 + .quad 0x3fc22cace0000000 + .quad 0x3fc254b4d0000000 + .quad 0x3fc27c9ff0000000 + .quad 0x3fc2a46e80000000 + .quad 0x3fc2cc20b0000000 + .quad 0x3fc2f3b690000000 + .quad 0x3fc31b3050000000 + .quad 0x3fc3428e20000000 + .quad 0x3fc369d020000000 + .quad 0x3fc390f680000000 + .quad 0x3fc3b80160000000 + .quad 0x3fc3def0e0000000 + .quad 0x3fc405c530000000 + .quad 0x3fc42c7e70000000 + .quad 0x3fc4531cd0000000 + .quad 0x3fc479a070000000 + .quad 0x3fc4a00970000000 + .quad 0x3fc4c65800000000 + .quad 0x3fc4ec8c30000000 + .quad 0x3fc512a640000000 + .quad 0x3fc538a630000000 + .quad 0x3fc55e8c50000000 + .quad 0x3fc5845890000000 + .quad 0x3fc5aa0b40000000 + .quad 0x3fc5cfa470000000 + .quad 0x3fc5f52440000000 + .quad 0x3fc61a8ad0000000 + .quad 0x3fc63fd850000000 + .quad 0x3fc6650cd0000000 + .quad 0x3fc68a2880000000 + .quad 0x3fc6af2b80000000 + .quad 0x3fc6d415e0000000 + .quad 0x3fc6f8e7d0000000 + .quad 0x3fc71da170000000 + .quad 0x3fc74242e0000000 + .quad 0x3fc766cc40000000 + .quad 0x3fc78b3da0000000 + .quad 0x3fc7af9730000000 + .quad 0x3fc7d3d910000000 + .quad 0x3fc7f80350000000 + .quad 0x3fc81c1620000000 + .quad 0x3fc8401190000000 + .quad 0x3fc863f5c0000000 + .quad 0x3fc887c2e0000000 + .quad 0x3fc8ab7900000000 + .quad 0x3fc8cf1830000000 + .quad 0x3fc8f2a0a0000000 + .quad 0x3fc9161270000000 + .quad 0x3fc9396db0000000 + .quad 0x3fc95cb280000000 + .quad 0x3fc97fe100000000 + .quad 0x3fc9a2f950000000 + .quad 0x3fc9c5fb70000000 + .quad 0x3fc9e8e7b0000000 + .quad 0x3fca0bbdf0000000 + .quad 0x3fca2e7e80000000 + .quad 0x3fca512960000000 + .quad 0x3fca73bea0000000 + .quad 0x3fca963e70000000 + .quad 0x3fcab8a8f0000000 + .quad 0x3fcadafe20000000 + .quad 0x3fcafd3e30000000 + .quad 0x3fcb1f6930000000 + .quad 0x3fcb417f40000000 + .quad 0x3fcb638070000000 + .quad 0x3fcb856cf0000000 + .quad 0x3fcba744b0000000 + .quad 0x3fcbc907f0000000 + .quad 0x3fcbeab6c0000000 + .quad 0x3fcc0c5130000000 + .quad 0x3fcc2dd750000000 + .quad 0x3fcc4f4950000000 + .quad 0x3fcc70a740000000 + .quad 0x3fcc91f130000000 + .quad 0x3fccb32740000000 + .quad 0x3fccd44980000000 + .quad 0x3fccf55810000000 + .quad 0x3fcd165300000000 + .quad 0x3fcd373a60000000 + .quad 0x3fcd580e60000000 + .quad 0x3fcd78cf00000000 + .quad 0x3fcd997c70000000 + .quad 0x3fcdba16a0000000 + .quad 0x3fcdda9dd0000000 + .quad 0x3fcdfb11f0000000 + .quad 0x3fce1b7330000000 + .quad 0x3fce3bc1a0000000 + .quad 0x3fce5bfd50000000 + .quad 0x3fce7c2660000000 + .quad 0x3fce9c3ce0000000 + .quad 0x3fcebc40e0000000 + .quad 0x3fcedc3280000000 + .quad 0x3fcefc11d0000000 + .quad 0x3fcf1bdee0000000 + .quad 0x3fcf3b99d0000000 + .quad 0x3fcf5b42a0000000 + .quad 0x3fcf7ad980000000 + .quad 0x3fcf9a5e70000000 + .quad 0x3fcfb9d190000000 + .quad 0x3fcfd932f0000000 + .quad 0x3fcff882a0000000 + .quad 0x3fd00be050000000 + .quad 0x3fd01b76a0000000 + .quad 0x3fd02b0430000000 + .quad 0x3fd03a8910000000 + .quad 0x3fd04a0540000000 + .quad 0x3fd05978e0000000 + .quad 0x3fd068e3f0000000 + .quad 0x3fd0784670000000 + .quad 0x3fd087a080000000 + .quad 0x3fd096f210000000 + .quad 0x3fd0a63b30000000 + .quad 0x3fd0b57bf0000000 + .quad 0x3fd0c4b450000000 + .quad 0x3fd0d3e460000000 + .quad 0x3fd0e30c30000000 + .quad 0x3fd0f22bc0000000 + .quad 0x3fd1014310000000 + .quad 0x3fd1105240000000 + .quad 0x3fd11f5940000000 + .quad 0x3fd12e5830000000 + .quad 0x3fd13d4f00000000 + .quad 0x3fd14c3dd0000000 + .quad 0x3fd15b24a0000000 + .quad 0x3fd16a0370000000 + .quad 0x3fd178da50000000 + .quad 0x3fd187a940000000 + .quad 0x3fd1967060000000 + .quad 0x3fd1a52fa0000000 + .quad 0x3fd1b3e710000000 + .quad 0x3fd1c296c0000000 + .quad 0x3fd1d13eb0000000 + .quad 0x3fd1dfdef0000000 + .quad 0x3fd1ee7770000000 + .quad 0x3fd1fd0860000000 + .quad 0x3fd20b91a0000000 + .quad 0x3fd21a1350000000 + .quad 0x3fd2288d70000000 + .quad 0x3fd2370010000000 + .quad 0x3fd2456b30000000 + .quad 0x3fd253ced0000000 + .quad 0x3fd2622b00000000 + .quad 0x3fd2707fd0000000 + .quad 0x3fd27ecd40000000 + .quad 0x3fd28d1360000000 + .quad 0x3fd29b5220000000 + .quad 0x3fd2a989a0000000 + .quad 0x3fd2b7b9e0000000 + .quad 0x3fd2c5e2e0000000 + .quad 0x3fd2d404b0000000 + .quad 0x3fd2e21f50000000 + .quad 0x3fd2f032c0000000 + .quad 0x3fd2fe3f20000000 + .quad 0x3fd30c4470000000 + .quad 0x3fd31a42b0000000 + .quad 0x3fd32839e0000000 + .quad 0x3fd3362a10000000 + .quad 0x3fd3441350000000 + +.align 16 +.L__log_256_tail: + .quad 0x0000000000000000 + .quad 0x3db20abc22b2208f + .quad 0x3db10f69332e0dd4 + .quad 0x3dce950de87ed257 + .quad 0x3dd3f3443b626d69 + .quad 0x3df45aeaa5363e57 + .quad 0x3dc443683ce1bf0b + .quad 0x3df989cd60c6a511 + .quad 0x3dfd626f201f2e9f + .quad 0x3de94f8bb8dabdcd + .quad 0x3e0088d8ef423015 + .quad 0x3e080413a62b79ad + .quad 0x3e059717c0eed3c4 + .quad 0x3dad4a77add44902 + .quad 0x3e0e763ff037300e + .quad 0x3de162d74706f6c3 + .quad 0x3e0601cc1f4dbc14 + .quad 0x3deaf3e051f6e5bf + .quad 0x3e097a0b1e1af3eb + .quad 0x3dc0a38970c002c7 + .quad 0x3e102e000057c751 + .quad 0x3e155b00eecd6e0e + .quad 0x3ddf86297003b5af + .quad 0x3e1057b9b336a36d + .quad 0x3e134bc84a06ea4f + .quad 0x3e1643da9ea1bcad + .quad 0x3e1d66a7b4f7ea2a + .quad 0x3df6b2e038f7fcef + .quad 0x3df3e954c670f088 + .quad 0x3e047209093acab3 + .quad 0x3e1d708fe7275da7 + .quad 0x3e1fdf9e7771b9e7 + .quad 0x3e0827bfa70a0660 + .quad 0x3e1601cc1f4dbc14 + .quad 0x3e0637f6106a5e5b + .quad 0x3e126a13f17c624b + .quad 0x3e093eb2ce80623a + .quad 0x3e1430d1e91594de + .quad 0x3e1d6b10108fa031 + .quad 0x3e16879c0bbaf241 + .quad 0x3dff08015ea6bc2b + .quad 0x3e29b63dcdc6676c + .quad 0x3e2b022cbcc4ab2c + .quad 0x3df917d07ddd6544 + .quad 0x3e1540605703379e + .quad 0x3e0cd18b947a1b60 + .quad 0x3e17ad65277ca97e + .quad 0x3e11884dc59f5fa9 + .quad 0x3e1711c46006d082 + .quad 0x3e2f092e3c3108f8 + .quad 0x3e1714c5e32be13a + .quad 0x3e26bba7fd734f9a + .quad 0x3dfdf48fb5e08483 + .quad 0x3e232f9bc74d0b95 + .quad 0x3df973e848790c13 + .quad 0x3e1eccbc08c6586e + .quad 0x3e2115e9f9524a98 + .quad 0x3e2f1740593131b8 + .quad 0x3e1bcf8b25643835 + .quad 0x3e1f5fa81d8bed80 + .quad 0x3e244a4df929d9e4 + .quad 0x3e129820d8220c94 + .quad 0x3e2a0b489304e309 + .quad 0x3e1f4d56aba665fe + .quad 0x3e210c9019365163 + .quad 0x3df80f78fe592736 + .quad 0x3e10528825c81cca + .quad 0x3de095537d6d746a + .quad 0x3e1827bfa70a0660 + .quad 0x3e06b0a8ec45933c + .quad 0x3e105af81bf5dba9 + .quad 0x3e17e2fa2655d515 + .quad 0x3e0d59ecbfaee4bf + .quad 0x3e1d8b2fda683fa3 + .quad 0x3e24b8ddfd3a3737 + .quad 0x3e13827e61ae1204 + .quad 0x3e2c8c7b49e90f9f + .quad 0x3e29eaf01597591d + .quad 0x3e19aaa66e317b36 + .quad 0x3e2e725609720655 + .quad 0x3e261c33fc7aac54 + .quad 0x3e29662bcf61a252 + .quad 0x3e1843c811c42730 + .quad 0x3e2064bb0b5acb36 + .quad 0x3e0a340c842701a4 + .quad 0x3e1a8e55b58f79d6 + .quad 0x3de92d219c5e9d9a + .quad 0x3e3f63e60d7ffd6a + .quad 0x3e2e9b0ed9516314 + .quad 0x3e2923901962350c + .quad 0x3e326f8838785e81 + .quad 0x3e3b5b6a4caba6af + .quad 0x3df0226adc8e761c + .quad 0x3e3c4ad7313a1aed + .quad 0x3e1564e87c738d17 + .quad 0x3e338fecf18a6618 + .quad 0x3e3d929ef5777666 + .quad 0x3e39483bf08da0b8 + .quad 0x3e3bdd0eeeaa5826 + .quad 0x3e39c4dd590237ba + .quad 0x3e1af3e9e0ebcac7 + .quad 0x3e35ce5382270dac + .quad 0x3e394f74532ab9ba + .quad 0x3e07342795888654 + .quad 0x3e0c5a000be34bf0 + .quad 0x3e2711c46006d082 + .quad 0x3e250025b4ed8cf8 + .quad 0x3e2ed18bcef2d2a0 + .quad 0x3e21282e0c0a7554 + .quad 0x3e0d70f33359a7ca + .quad 0x3e2b7f7e13a84025 + .quad 0x3e33306ec321891e + .quad 0x3e3fc7f8038b7550 + .quad 0x3e3eb0358cd71d64 + .quad 0x3e3a76c822859474 + .quad 0x3e3d0ec652de86e3 + .quad 0x3e2fa4cce08658af + .quad 0x3e3b84a2d2c00a9e + .quad 0x3e20a5b0f2c25bd1 + .quad 0x3e3dd660225bf699 + .quad 0x3e08b10f859bf037 + .quad 0x3e3e8823b590cbe1 + .quad 0x3e361311f31e96f6 + .quad 0x3e2e1f875ca20f9a + .quad 0x3e2c95724939b9a5 + .quad 0x3e3805957a3e58e2 + .quad 0x3e2ff126ea9f0334 + .quad 0x3e3953f5598e5609 + .quad 0x3e36c16ff856c448 + .quad 0x3e24cb220ff261f4 + .quad 0x3e35e120d53d53a2 + .quad 0x3e3a527f6189f256 + .quad 0x3e3856fcffd49c0f + .quad 0x3e300c2e8228d7da + .quad 0x3df113d09444dfe0 + .quad 0x3e2510630eea59a6 + .quad 0x3e262e780f32d711 + .quad 0x3ded3ed91a10f8cf + .quad 0x3e23654a7e4bcd85 + .quad 0x3e055b784980ad21 + .quad 0x3e212f2dd4b16e64 + .quad 0x3e37c4add939f50c + .quad 0x3e281784627180fc + .quad 0x3dea5162c7e14961 + .quad 0x3e310c9019365163 + .quad 0x3e373c4d2ba17688 + .quad 0x3e2ae8a5e0e93d81 + .quad 0x3e2ab0c6f01621af + .quad 0x3e301e8b74dd5b66 + .quad 0x3e2d206fecbb5494 + .quad 0x3df0b48b724fcc00 + .quad 0x3e3f831f0b61e229 + .quad 0x3df81a97c407bcaf + .quad 0x3e3e286c1ccbb7aa + .quad 0x3e28630b49220a93 + .quad 0x3dff0b15c1a22c5c + .quad 0x3e355445e71c0946 + .quad 0x3e3be630f8066d85 + .quad 0x3e2599dff0d96c39 + .quad 0x3e36cc85b18fb081 + .quad 0x3e34476d001ea8c8 + .quad 0x3e373f889e16d31f + .quad 0x3e3357100d792a87 + .quad 0x3e3bd179ae6101f6 + .quad 0x3e0ca31056c3f6e2 + .quad 0x3e3d2870629c08fb + .quad 0x3e3aba3880d2673f + .quad 0x3e2c3633cb297da6 + .quad 0x3e21843899efea02 + .quad 0x3e3bccc99d2008e6 + .quad 0x3e38000544bdd350 + .quad 0x3e2b91c226606ae1 + .quad 0x3e2a7adf26b62bdf + .quad 0x3e18764fc8826ec9 + .quad 0x3e1f4f3de50f68f0 + .quad 0x3df760ca757995e3 + .quad 0x3dfc667ed3805147 + .quad 0x3e3733f6196adf6f + .quad 0x3e2fb710f33e836b + .quad 0x3e39886eba641013 + .quad 0x3dfb5368d0af8c1a + .quad 0x3e358c691b8d2971 + .quad 0x3dfe9465226d08fb + .quad 0x3e33587e063f0097 + .quad 0x3e3618e702129f18 + .quad 0x3e361c33fc7aac54 + .quad 0x3e3f07a68408604a + .quad 0x3e3c34bfe4945421 + .quad 0x3e38b1f00e41300b + .quad 0x3e3f434284d61b63 + .quad 0x3e3a63095e397436 + .quad 0x3e34428656b919de + .quad 0x3e36ca9201b2d9a6 + .quad 0x3e2738823a2a931c + .quad 0x3e3c11880e179230 + .quad 0x3e313ddc8d6d52fe + .quad 0x3e33eed58922e917 + .quad 0x3e295992846bdd50 + .quad 0x3e0ddb4d5f2e278b + .quad 0x3df1a5f12a0635c4 + .quad 0x3e4642f0882c3c34 + .quad 0x3e2aee9ba7f6475e + .quad 0x3e264b7f834a60e4 + .quad 0x3e290d42e243792e + .quad 0x3e4c272008134f01 + .quad 0x3e4a782e16d6cf5b + .quad 0x3e44505c79da6648 + .quad 0x3e4ca9d4ea4dcd21 + .quad 0x3e297d3d627cd5bc + .quad 0x3e20b15cf9bcaa13 + .quad 0x3e315b2063cf76dd + .quad 0x3e2983e6f3aa2748 + .quad 0x3e3f4c64f4ffe994 + .quad 0x3e46beba7ce85a0f + .quad 0x3e3b9c69fd4ea6b8 + .quad 0x3e2b6aa5835fa4ab + .quad 0x3e43ccc3790fedd1 + .quad 0x3e29c04cc4404fe0 + .quad 0x3e40734b7a75d89d + .quad 0x3e1b4404c4e01612 + .quad 0x3e40c565c2ce4894 + .quad 0x3e33c71441d935cd + .quad 0x3d72a492556b3b4e + .quad 0x3e20fa090341dc43 + .quad 0x3e2e8f7009e3d9f4 + .quad 0x3e4b1bf68b048a45 + .quad 0x3e3eee52dffaa956 + .quad 0x3e456b0900e465bd + .quad 0x3e4d929ef5777666 + .quad 0x3e486ea28637e260 + .quad 0x3e4665aff10ca2f0 + .quad 0x3e2f11fdaf48ec74 + .quad 0x3e4cbe1b86a4d1c7 + .quad 0x3e25b05bfea87665 + .quad 0x3e41cec20a1a4a1d + .quad 0x3e41cd5f0a409b9f + .quad 0x3e453656c8265070 + .quad 0x3e377ed835282260 + .quad 0x3e2417bc3040b9d2 + .quad 0x3e408eef7b79eff2 + .quad 0x3e4dc76f39dc57e9 + .quad 0x3e4c0493a70cf457 + .quad 0x3e4a83d6cea5a60c + .quad 0x3e30d6700dc557ba + .quad 0x3e44c96c12e8bd0a + .quad 0x3e3d2c1993e32315 + .quad 0x3e22c721135f8242 + .quad 0x3e279a3e4dda747d + .quad 0x3dfcf89f6941a72b + .quad 0x3e2149a702f10831 + .quad 0x3e4ead4b7c8175db + .quad 0x3e4e6930fe63e70a + .quad 0x3e41e106bed9ee2f + .quad 0x3e2d682b82f11c92 + .quad 0x3e3a07f188dba47c + .quad 0x3e40f9342dc172f6 + .quad 0x3e03ef3fde623e25 + +.align 16 +.L__log_F_inv: + .quad 0x4000000000000000 + .quad 0x3fffe01fe01fe020 + .quad 0x3fffc07f01fc07f0 + .quad 0x3fffa11caa01fa12 + .quad 0x3fff81f81f81f820 + .quad 0x3fff6310aca0dbb5 + .quad 0x3fff44659e4a4271 + .quad 0x3fff25f644230ab5 + .quad 0x3fff07c1f07c1f08 + .quad 0x3ffee9c7f8458e02 + .quad 0x3ffecc07b301ecc0 + .quad 0x3ffeae807aba01eb + .quad 0x3ffe9131abf0b767 + .quad 0x3ffe741aa59750e4 + .quad 0x3ffe573ac901e574 + .quad 0x3ffe3a9179dc1a73 + .quad 0x3ffe1e1e1e1e1e1e + .quad 0x3ffe01e01e01e01e + .quad 0x3ffde5d6e3f8868a + .quad 0x3ffdca01dca01dca + .quad 0x3ffdae6076b981db + .quad 0x3ffd92f2231e7f8a + .quad 0x3ffd77b654b82c34 + .quad 0x3ffd5cac807572b2 + .quad 0x3ffd41d41d41d41d + .quad 0x3ffd272ca3fc5b1a + .quad 0x3ffd0cb58f6ec074 + .quad 0x3ffcf26e5c44bfc6 + .quad 0x3ffcd85689039b0b + .quad 0x3ffcbe6d9601cbe7 + .quad 0x3ffca4b3055ee191 + .quad 0x3ffc8b265afb8a42 + .quad 0x3ffc71c71c71c71c + .quad 0x3ffc5894d10d4986 + .quad 0x3ffc3f8f01c3f8f0 + .quad 0x3ffc26b5392ea01c + .quad 0x3ffc0e070381c0e0 + .quad 0x3ffbf583ee868d8b + .quad 0x3ffbdd2b899406f7 + .quad 0x3ffbc4fd65883e7b + .quad 0x3ffbacf914c1bad0 + .quad 0x3ffb951e2b18ff23 + .quad 0x3ffb7d6c3dda338b + .quad 0x3ffb65e2e3beee05 + .quad 0x3ffb4e81b4e81b4f + .quad 0x3ffb37484ad806ce + .quad 0x3ffb2036406c80d9 + .quad 0x3ffb094b31d922a4 + .quad 0x3ffaf286bca1af28 + .quad 0x3ffadbe87f94905e + .quad 0x3ffac5701ac5701b + .quad 0x3ffaaf1d2f87ebfd + .quad 0x3ffa98ef606a63be + .quad 0x3ffa82e65130e159 + .quad 0x3ffa6d01a6d01a6d + .quad 0x3ffa574107688a4a + .quad 0x3ffa41a41a41a41a + .quad 0x3ffa2c2a87c51ca0 + .quad 0x3ffa16d3f97a4b02 + .quad 0x3ffa01a01a01a01a + .quad 0x3ff9ec8e951033d9 + .quad 0x3ff9d79f176b682d + .quad 0x3ff9c2d14ee4a102 + .quad 0x3ff9ae24ea5510da + .quad 0x3ff999999999999a + .quad 0x3ff9852f0d8ec0ff + .quad 0x3ff970e4f80cb872 + .quad 0x3ff95cbb0be377ae + .quad 0x3ff948b0fcd6e9e0 + .quad 0x3ff934c67f9b2ce6 + .quad 0x3ff920fb49d0e229 + .quad 0x3ff90d4f120190d5 + .quad 0x3ff8f9c18f9c18fa + .quad 0x3ff8e6527af1373f + .quad 0x3ff8d3018d3018d3 + .quad 0x3ff8bfce8062ff3a + .quad 0x3ff8acb90f6bf3aa + .quad 0x3ff899c0f601899c + .quad 0x3ff886e5f0abb04a + .quad 0x3ff87427bcc092b9 + .quad 0x3ff8618618618618 + .quad 0x3ff84f00c2780614 + .quad 0x3ff83c977ab2bedd + .quad 0x3ff82a4a0182a4a0 + .quad 0x3ff8181818181818 + .quad 0x3ff8060180601806 + .quad 0x3ff7f405fd017f40 + .quad 0x3ff7e225515a4f1d + .quad 0x3ff7d05f417d05f4 + .quad 0x3ff7beb3922e017c + .quad 0x3ff7ad2208e0ecc3 + .quad 0x3ff79baa6bb6398b + .quad 0x3ff78a4c8178a4c8 + .quad 0x3ff77908119ac60d + .quad 0x3ff767dce434a9b1 + .quad 0x3ff756cac201756d + .quad 0x3ff745d1745d1746 + .quad 0x3ff734f0c541fe8d + .quad 0x3ff724287f46debc + .quad 0x3ff713786d9c7c09 + .quad 0x3ff702e05c0b8170 + .quad 0x3ff6f26016f26017 + .quad 0x3ff6e1f76b4337c7 + .quad 0x3ff6d1a62681c861 + .quad 0x3ff6c16c16c16c17 + .quad 0x3ff6b1490aa31a3d + .quad 0x3ff6a13cd1537290 + .quad 0x3ff691473a88d0c0 + .quad 0x3ff6816816816817 + .quad 0x3ff6719f3601671a + .quad 0x3ff661ec6a5122f9 + .quad 0x3ff6524f853b4aa3 + .quad 0x3ff642c8590b2164 + .quad 0x3ff63356b88ac0de + .quad 0x3ff623fa77016240 + .quad 0x3ff614b36831ae94 + .quad 0x3ff6058160581606 + .quad 0x3ff5f66434292dfc + .quad 0x3ff5e75bb8d015e7 + .quad 0x3ff5d867c3ece2a5 + .quad 0x3ff5c9882b931057 + .quad 0x3ff5babcc647fa91 + .quad 0x3ff5ac056b015ac0 + .quad 0x3ff59d61f123ccaa + .quad 0x3ff58ed2308158ed + .quad 0x3ff5805601580560 + .quad 0x3ff571ed3c506b3a + .quad 0x3ff56397ba7c52e2 + .quad 0x3ff5555555555555 + .quad 0x3ff54725e6bb82fe + .quad 0x3ff5390948f40feb + .quad 0x3ff52aff56a8054b + .quad 0x3ff51d07eae2f815 + .quad 0x3ff50f22e111c4c5 + .quad 0x3ff5015015015015 + .quad 0x3ff4f38f62dd4c9b + .quad 0x3ff4e5e0a72f0539 + .quad 0x3ff4d843bedc2c4c + .quad 0x3ff4cab88725af6e + .quad 0x3ff4bd3edda68fe1 + .quad 0x3ff4afd6a052bf5b + .quad 0x3ff4a27fad76014a + .quad 0x3ff49539e3b2d067 + .quad 0x3ff4880522014880 + .quad 0x3ff47ae147ae147b + .quad 0x3ff46dce34596066 + .quad 0x3ff460cbc7f5cf9a + .quad 0x3ff453d9e2c776ca + .quad 0x3ff446f86562d9fb + .quad 0x3ff43a2730abee4d + .quad 0x3ff42d6625d51f87 + .quad 0x3ff420b5265e5951 + .quad 0x3ff4141414141414 + .quad 0x3ff40782d10e6566 + .quad 0x3ff3fb013fb013fb + .quad 0x3ff3ee8f42a5af07 + .quad 0x3ff3e22cbce4a902 + .quad 0x3ff3d5d991aa75c6 + .quad 0x3ff3c995a47babe7 + .quad 0x3ff3bd60d9232955 + .quad 0x3ff3b13b13b13b14 + .quad 0x3ff3a524387ac822 + .quad 0x3ff3991c2c187f63 + .quad 0x3ff38d22d366088e + .quad 0x3ff3813813813814 + .quad 0x3ff3755bd1c945ee + .quad 0x3ff3698df3de0748 + .quad 0x3ff35dce5f9f2af8 + .quad 0x3ff3521cfb2b78c1 + .quad 0x3ff34679ace01346 + .quad 0x3ff33ae45b57bcb2 + .quad 0x3ff32f5ced6a1dfa + .quad 0x3ff323e34a2b10bf + .quad 0x3ff3187758e9ebb6 + .quad 0x3ff30d190130d190 + .quad 0x3ff301c82ac40260 + .quad 0x3ff2f684bda12f68 + .quad 0x3ff2eb4ea1fed14b + .quad 0x3ff2e025c04b8097 + .quad 0x3ff2d50a012d50a0 + .quad 0x3ff2c9fb4d812ca0 + .quad 0x3ff2bef98e5a3711 + .quad 0x3ff2b404ad012b40 + .quad 0x3ff2a91c92f3c105 + .quad 0x3ff29e4129e4129e + .quad 0x3ff293725bb804a5 + .quad 0x3ff288b01288b013 + .quad 0x3ff27dfa38a1ce4d + .quad 0x3ff27350b8812735 + .quad 0x3ff268b37cd60127 + .quad 0x3ff25e22708092f1 + .quad 0x3ff2539d7e9177b2 + .quad 0x3ff2492492492492 + .quad 0x3ff23eb79717605b + .quad 0x3ff23456789abcdf + .quad 0x3ff22a0122a0122a + .quad 0x3ff21fb78121fb78 + .quad 0x3ff21579804855e6 + .quad 0x3ff20b470c67c0d9 + .quad 0x3ff2012012012012 + .quad 0x3ff1f7047dc11f70 + .quad 0x3ff1ecf43c7fb84c + .quad 0x3ff1e2ef3b3fb874 + .quad 0x3ff1d8f5672e4abd + .quad 0x3ff1cf06ada2811d + .quad 0x3ff1c522fc1ce059 + .quad 0x3ff1bb4a4046ed29 + .quad 0x3ff1b17c67f2bae3 + .quad 0x3ff1a7b9611a7b96 + .quad 0x3ff19e0119e0119e + .quad 0x3ff19453808ca29c + .quad 0x3ff18ab083902bdb + .quad 0x3ff1811811811812 + .quad 0x3ff1778a191bd684 + .quad 0x3ff16e0689427379 + .quad 0x3ff1648d50fc3201 + .quad 0x3ff15b1e5f75270d + .quad 0x3ff151b9a3fdd5c9 + .quad 0x3ff1485f0e0acd3b + .quad 0x3ff13f0e8d344724 + .quad 0x3ff135c81135c811 + .quad 0x3ff12c8b89edc0ac + .quad 0x3ff12358e75d3033 + .quad 0x3ff11a3019a74826 + .quad 0x3ff1111111111111 + .quad 0x3ff107fbbe011080 + .quad 0x3ff0fef010fef011 + .quad 0x3ff0f5edfab325a2 + .quad 0x3ff0ecf56be69c90 + .quad 0x3ff0e40655826011 + .quad 0x3ff0db20a88f4696 + .quad 0x3ff0d24456359e3a + .quad 0x3ff0c9714fbcda3b + .quad 0x3ff0c0a7868b4171 + .quad 0x3ff0b7e6ec259dc8 + .quad 0x3ff0af2f722eecb5 + .quad 0x3ff0a6810a6810a7 + .quad 0x3ff09ddba6af8360 + .quad 0x3ff0953f39010954 + .quad 0x3ff08cabb37565e2 + .quad 0x3ff0842108421084 + .quad 0x3ff07b9f29b8eae2 + .quad 0x3ff073260a47f7c6 + .quad 0x3ff06ab59c7912fb + .quad 0x3ff0624dd2f1a9fc + .quad 0x3ff059eea0727586 + .quad 0x3ff05197f7d73404 + .quad 0x3ff04949cc1664c5 + .quad 0x3ff0410410410410 + .quad 0x3ff038c6b78247fc + .quad 0x3ff03091b51f5e1a + .quad 0x3ff02864fc7729e9 + .quad 0x3ff0204081020408 + .quad 0x3ff0182436517a37 + .quad 0x3ff0101010101010 + .quad 0x3ff0080402010080 + .quad 0x3ff0000000000000 + .quad 0x0000000000000000 + +
diff --git a/src/gas/log10f.S b/src/gas/log10f.S new file mode 100644 index 0000000..eb89c6c --- /dev/null +++ b/src/gas/log10f.S
@@ -0,0 +1,745 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# +# log10f.S +# +# An implementation of the log10f libm function. +# +# Prototype: +# +# float log10f(float x); +# + +# +# Algorithm: +# Similar to one presnted in log.S +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(log10f) +#define fname_special _log10f_special@PLT + + +# local variable storage offsets +.equ p_temp, 0x0 +.equ stack_size, 0x18 + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + sub $stack_size, %rsp + + # compute exponent part + xor %eax, %eax + movdqa %xmm0, %xmm3 + movss %xmm0, %xmm4 + psrld $23, %xmm3 + movd %xmm0, %eax + psubd .L__mask_127(%rip), %xmm3 + movdqa %xmm0, %xmm2 + cvtdq2ps %xmm3, %xmm5 # xexp + + # NaN or inf + movdqa %xmm0, %xmm1 + andps .L__real_inf(%rip), %xmm1 + comiss .L__real_inf(%rip), %xmm1 + je .L__x_is_inf_or_nan + + # check for negative numbers or zero + xorps %xmm1, %xmm1 + comiss %xmm1, %xmm0 + jbe .L__x_is_zero_or_neg + + pand .L__real_mant(%rip), %xmm2 + subss .L__real_one(%rip), %xmm4 + + comiss .L__real_neg127(%rip), %xmm5 + je .L__denormal_adjust + +.L__continue_common: + + # compute index into the log tables + mov %eax, %r9d + and .L__mask_mant_all7(%rip), %eax + and .L__mask_mant8(%rip), %r9d + shl $1, %r9d + add %r9d, %eax + mov %eax, p_temp(%rsp) + + # near one codepath + andps .L__real_notsign(%rip), %xmm4 + comiss .L__real_threshold(%rip), %xmm4 + jb .L__near_one + + # F, Y + movss p_temp(%rsp), %xmm1 + shr $16, %eax + por .L__real_half(%rip), %xmm2 + por .L__real_half(%rip), %xmm1 + lea .L__log_F_inv(%rip), %r9 + + # f = F - Y, r = f * inv + subss %xmm2, %xmm1 + mulss (%r9,%rax,4), %xmm1 + + movss %xmm1, %xmm2 + movss %xmm1, %xmm0 + + # poly + mulss .L__real_1_over_3(%rip), %xmm2 + mulss %xmm1, %xmm0 + addss .L__real_1_over_2(%rip), %xmm2 + movss .L__real_log10_2_tail(%rip), %xmm3 + + lea .L__log_128_tail(%rip), %r9 + lea .L__log_128_lead(%rip), %r10 + + mulss %xmm0, %xmm2 + mulss %xmm5, %xmm3 + addss %xmm2, %xmm1 + + mulss .L__real_log10_e(%rip), %xmm1 + + # m*log(10) + log10(G) - poly + movss .L__real_log10_2_lead(%rip), %xmm0 + subss %xmm1, %xmm3 # z2 + mulss %xmm5, %xmm0 + addss (%r9,%rax,4), %xmm3 + addss (%r10,%rax,4), %xmm0 + + addss %xmm3, %xmm0 + + add $stack_size, %rsp + ret + +.p2align 4,,15 +.L__near_one: + # r = x - 1.0# + movss .L__real_two(%rip), %xmm2 + subss .L__real_one(%rip), %xmm0 + + # u = r / (2.0 + r) + addss %xmm0, %xmm2 + movss %xmm0, %xmm1 + divss %xmm2, %xmm1 # u + + # correction = r * u + movss %xmm0, %xmm4 + mulss %xmm1, %xmm4 + + # u = u + u# + addss %xmm1, %xmm1 + movss %xmm1, %xmm2 + mulss %xmm2, %xmm2 # v = u^2 + + # r2 = (u * v * (ca_1 + v * ca_2) - correction) + movss %xmm1, %xmm3 + mulss %xmm2, %xmm3 # u^3 + mulss .L__real_ca2(%rip), %xmm2 # Bu^2 + addss .L__real_ca1(%rip), %xmm2 # +A + mulss %xmm3, %xmm2 + subss %xmm4, %xmm2 # -correction + + movdqa %xmm0, %xmm5 + pand .L__mask_lower(%rip), %xmm5 + subss %xmm5, %xmm0 + addss %xmm0, %xmm2 + + movss %xmm5, %xmm0 + movss %xmm2, %xmm1 + + mulss .L__real_log10_e_tail(%rip), %xmm2 + mulss .L__real_log10_e_tail(%rip), %xmm0 + mulss .L__real_log10_e_lead(%rip), %xmm1 + mulss .L__real_log10_e_lead(%rip), %xmm5 + + addss %xmm2, %xmm0 + addss %xmm1, %xmm0 + addss %xmm5, %xmm0 + + add $stack_size, %rsp + ret + +.L__denormal_adjust: + por .L__real_one(%rip), %xmm2 + subss .L__real_one(%rip), %xmm2 + movdqa %xmm2, %xmm5 + pand .L__real_mant(%rip), %xmm2 + movd %xmm2, %eax + psrld $23, %xmm5 + psubd .L__mask_253(%rip), %xmm5 + cvtdq2ps %xmm5, %xmm5 + jmp .L__continue_common + +.p2align 4,,15 +.L__x_is_zero_or_neg: + jne .L__x_is_neg + + movss .L__real_ninf(%rip), %xmm1 + mov .L__flag_x_zero(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__x_is_neg: + + movss .L__real_nan(%rip), %xmm1 + mov .L__flag_x_neg(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__x_is_inf_or_nan: + + cmp .L__real_inf(%rip), %eax + je .L__finish + + cmp .L__real_ninf(%rip), %eax + je .L__x_is_neg + + mov .L__real_qnanbit(%rip), %r9d + and %eax, %r9d + jnz .L__finish + + or .L__real_qnanbit(%rip), %eax + movd %eax, %xmm1 + mov .L__flag_x_nan(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__finish: + add $stack_size, %rsp + ret + + +.data + +.align 16 + +# these codes and the ones in the corresponding .c file have to match +.L__flag_x_zero: .long 00000001 +.L__flag_x_neg: .long 00000002 +.L__flag_x_nan: .long 00000003 + +.align 16 + +.L__real_one: .quad 0x03f8000003f800000 # 1.0 + .quad 0x03f8000003f800000 +.L__real_two: .quad 0x04000000040000000 # 1.0 + .quad 0x04000000040000000 +.L__real_ninf: .quad 0x0ff800000ff800000 # -inf + .quad 0x0ff800000ff800000 +.L__real_inf: .quad 0x07f8000007f800000 # +inf + .quad 0x07f8000007f800000 +.L__real_nan: .quad 0x07fc000007fc00000 # NaN + .quad 0x07fc000007fc00000 +.L__real_ef: .quad 0x0402DF854402DF854 # float e + .quad 0x0402DF854402DF854 +.L__real_neg_qnan: .quad 0x0ffc00000ffc00000 + .quad 0x0ffc00000ffc00000 + +.L__real_sign: .quad 0x08000000080000000 # sign bit + .quad 0x08000000080000000 +.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit + .quad 0x07ffFFFFF7ffFFFFF +.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit + .quad 0x00040000000400000 +.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantissa bits + .quad 0x0007FFFFF007FFFFF +.L__mask_127: .quad 0x00000007f0000007f # + .quad 0x00000007f0000007f + +.L__mask_mant_all7: .quad 0x00000000007f0000 + .quad 0x00000000007f0000 +.L__mask_mant8: .quad 0x0000000000008000 + .quad 0x0000000000008000 + +.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02 + .quad 0x03C4CCCCD3C4CCCCD + +.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375 + .quad 0x03F3170003F317000 +.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183 + .quad 0x03805FDF43805FDF4 +.L__real_half: .quad 0x03f0000003f000000 # 1/2 + .quad 0x03f0000003f000000 + +.L__real_log10_e_lead: .quad 0x3EDE00003EDE0000 # log10e_lead 0.4335937500 + .quad 0x3EDE00003EDE0000 +.L__real_log10_e_tail: .quad 0x3A37B1523A37B152 # log10e_tail 0.0007007319 + .quad 0x3A37B1523A37B152 + +.L__real_log10_2_lead: .quad 0x3e9a00003e9a0000 + .quad 0x0000000000000000 +.L__real_log10_2_tail: .quad 0x39826a1339826a13 + .quad 0x0000000000000000 +.L__real_log10_e: .quad 0x3ede5bd93ede5bd9 + .quad 0x0000000000000000 + +.L__mask_lower: .quad 0x0ffff0000ffff0000 + .quad 0x0ffff0000ffff0000 + +.align 16 + +.L__real_neg127: .long 0x0c2fe0000 + .long 0 + .quad 0 + +.L__mask_253: .long 0x000000fd + .long 0 + .quad 0 + +.L__real_threshold: .long 0x3d800000 + .long 0 + .quad 0 + +.L__mask_01: .long 0x00000001 + .long 0 + .quad 0 + +.L__mask_80: .long 0x00000080 + .long 0 + .quad 0 + +.L__real_3b800000: .long 0x3b800000 + .long 0 + .quad 0 + +.L__real_1_over_3: .long 0x3eaaaaab + .long 0 + .quad 0 + +.L__real_1_over_2: .long 0x3f000000 + .long 0 + .quad 0 + +.align 16 +.L__log_128_lead: + .long 0x00000000 + .long 0x3b5d4000 + .long 0x3bdc8000 + .long 0x3c24c000 + .long 0x3c5ac000 + .long 0x3c884000 + .long 0x3ca2c000 + .long 0x3cbd4000 + .long 0x3cd78000 + .long 0x3cf1c000 + .long 0x3d05c000 + .long 0x3d128000 + .long 0x3d1f4000 + .long 0x3d2c0000 + .long 0x3d388000 + .long 0x3d450000 + .long 0x3d518000 + .long 0x3d5dc000 + .long 0x3d6a0000 + .long 0x3d760000 + .long 0x3d810000 + .long 0x3d870000 + .long 0x3d8d0000 + .long 0x3d92c000 + .long 0x3d98c000 + .long 0x3d9e8000 + .long 0x3da44000 + .long 0x3daa0000 + .long 0x3dafc000 + .long 0x3db58000 + .long 0x3dbb4000 + .long 0x3dc0c000 + .long 0x3dc64000 + .long 0x3dcc0000 + .long 0x3dd18000 + .long 0x3dd6c000 + .long 0x3ddc4000 + .long 0x3de1c000 + .long 0x3de70000 + .long 0x3dec8000 + .long 0x3df1c000 + .long 0x3df70000 + .long 0x3dfc4000 + .long 0x3e00c000 + .long 0x3e034000 + .long 0x3e05c000 + .long 0x3e088000 + .long 0x3e0b0000 + .long 0x3e0d8000 + .long 0x3e100000 + .long 0x3e128000 + .long 0x3e150000 + .long 0x3e178000 + .long 0x3e1a0000 + .long 0x3e1c8000 + .long 0x3e1ec000 + .long 0x3e214000 + .long 0x3e23c000 + .long 0x3e260000 + .long 0x3e288000 + .long 0x3e2ac000 + .long 0x3e2d4000 + .long 0x3e2f8000 + .long 0x3e31c000 + .long 0x3e344000 + .long 0x3e368000 + .long 0x3e38c000 + .long 0x3e3b0000 + .long 0x3e3d4000 + .long 0x3e3fc000 + .long 0x3e420000 + .long 0x3e440000 + .long 0x3e464000 + .long 0x3e488000 + .long 0x3e4ac000 + .long 0x3e4d0000 + .long 0x3e4f4000 + .long 0x3e514000 + .long 0x3e538000 + .long 0x3e55c000 + .long 0x3e57c000 + .long 0x3e5a0000 + .long 0x3e5c0000 + .long 0x3e5e4000 + .long 0x3e604000 + .long 0x3e624000 + .long 0x3e648000 + .long 0x3e668000 + .long 0x3e688000 + .long 0x3e6ac000 + .long 0x3e6cc000 + .long 0x3e6ec000 + .long 0x3e70c000 + .long 0x3e72c000 + .long 0x3e74c000 + .long 0x3e76c000 + .long 0x3e78c000 + .long 0x3e7ac000 + .long 0x3e7cc000 + .long 0x3e7ec000 + .long 0x3e804000 + .long 0x3e814000 + .long 0x3e824000 + .long 0x3e834000 + .long 0x3e840000 + .long 0x3e850000 + .long 0x3e860000 + .long 0x3e870000 + .long 0x3e880000 + .long 0x3e88c000 + .long 0x3e89c000 + .long 0x3e8ac000 + .long 0x3e8bc000 + .long 0x3e8c8000 + .long 0x3e8d8000 + .long 0x3e8e8000 + .long 0x3e8f4000 + .long 0x3e904000 + .long 0x3e914000 + .long 0x3e920000 + .long 0x3e930000 + .long 0x3e93c000 + .long 0x3e94c000 + .long 0x3e958000 + .long 0x3e968000 + .long 0x3e978000 + .long 0x3e984000 + .long 0x3e994000 + .long 0x3e9a0000 + +.align 16 +.L__log_128_tail: + .long 0x00000000 + .long 0x367a8e44 + .long 0x368ed49f + .long 0x36c21451 + .long 0x375211d6 + .long 0x3720ea11 + .long 0x37e9eb59 + .long 0x37b87be7 + .long 0x37bf2560 + .long 0x33d597a0 + .long 0x37806a05 + .long 0x3820581f + .long 0x38223334 + .long 0x378e3bac + .long 0x3810684f + .long 0x37feb7ae + .long 0x36a9d609 + .long 0x37a68163 + .long 0x376a8b27 + .long 0x384c8fd6 + .long 0x3885183e + .long 0x3874a760 + .long 0x380d1154 + .long 0x38ea42bd + .long 0x384c1571 + .long 0x38ba66b8 + .long 0x38e7da3b + .long 0x38eee632 + .long 0x38d00911 + .long 0x388bbede + .long 0x378a0512 + .long 0x3894c7a0 + .long 0x38e30710 + .long 0x36db2829 + .long 0x3729d609 + .long 0x38fa0e82 + .long 0x38bc9a75 + .long 0x383a9297 + .long 0x38dc83c8 + .long 0x37eac335 + .long 0x38706ac3 + .long 0x389574c2 + .long 0x3892d068 + .long 0x38615032 + .long 0x3917acf4 + .long 0x3967a126 + .long 0x38217840 + .long 0x38b420ab + .long 0x38f9c7b2 + .long 0x391103bd + .long 0x39169a6b + .long 0x390dd194 + .long 0x38eda471 + .long 0x38a38950 + .long 0x37f6844a + .long 0x395e1cdb + .long 0x390fcffc + .long 0x38503e9d + .long 0x394b00fd + .long 0x38a9910a + .long 0x39518a31 + .long 0x3882d2c2 + .long 0x392488e4 + .long 0x397b0aff + .long 0x388a22d8 + .long 0x3902bd5e + .long 0x39342f85 + .long 0x39598811 + .long 0x3972e6b1 + .long 0x34d53654 + .long 0x360ca25e + .long 0x39785cc0 + .long 0x39630710 + .long 0x39424ed7 + .long 0x39165101 + .long 0x38be5421 + .long 0x37e7b0c0 + .long 0x394fd0c3 + .long 0x38efaaaa + .long 0x37a8f566 + .long 0x3927c744 + .long 0x383fa4d5 + .long 0x392d9e39 + .long 0x3803feae + .long 0x390a268c + .long 0x39692b80 + .long 0x38789b4f + .long 0x3909307d + .long 0x394a601c + .long 0x35e67edc + .long 0x383e386d + .long 0x38a7743d + .long 0x38dccec3 + .long 0x38ff57e0 + .long 0x39079d8b + .long 0x390651a6 + .long 0x38f7bad9 + .long 0x38d0ab82 + .long 0x38979e7d + .long 0x381978ee + .long 0x397816c8 + .long 0x39410cb2 + .long 0x39015384 + .long 0x3863fa28 + .long 0x39f41065 + .long 0x39c7668a + .long 0x39968afa + .long 0x39430db9 + .long 0x38a18cf3 + .long 0x39eb2907 + .long 0x39a9e10c + .long 0x39492800 + .long 0x385a53d1 + .long 0x39ce0cf7 + .long 0x3979c7b2 + .long 0x389f5d99 + .long 0x39ceefcb + .long 0x39646a39 + .long 0x380d7a9b + .long 0x39ad6650 + .long 0x390ac3b8 + .long 0x39d9a9a8 + .long 0x39548a99 + .long 0x39f73c4b + .long 0x3980960e + .long 0x374b3d5a + .long 0x39888f1e + .long 0x37679a07 + .long 0x39826a13 + +.align 16 +.L__log_F_inv: + .long 0x40000000 + .long 0x3ffe03f8 + .long 0x3ffc0fc1 + .long 0x3ffa232d + .long 0x3ff83e10 + .long 0x3ff6603e + .long 0x3ff4898d + .long 0x3ff2b9d6 + .long 0x3ff0f0f1 + .long 0x3fef2eb7 + .long 0x3fed7304 + .long 0x3febbdb3 + .long 0x3fea0ea1 + .long 0x3fe865ac + .long 0x3fe6c2b4 + .long 0x3fe52598 + .long 0x3fe38e39 + .long 0x3fe1fc78 + .long 0x3fe07038 + .long 0x3fdee95c + .long 0x3fdd67c9 + .long 0x3fdbeb62 + .long 0x3fda740e + .long 0x3fd901b2 + .long 0x3fd79436 + .long 0x3fd62b81 + .long 0x3fd4c77b + .long 0x3fd3680d + .long 0x3fd20d21 + .long 0x3fd0b6a0 + .long 0x3fcf6475 + .long 0x3fce168a + .long 0x3fcccccd + .long 0x3fcb8728 + .long 0x3fca4588 + .long 0x3fc907da + .long 0x3fc7ce0c + .long 0x3fc6980c + .long 0x3fc565c8 + .long 0x3fc43730 + .long 0x3fc30c31 + .long 0x3fc1e4bc + .long 0x3fc0c0c1 + .long 0x3fbfa030 + .long 0x3fbe82fa + .long 0x3fbd6910 + .long 0x3fbc5264 + .long 0x3fbb3ee7 + .long 0x3fba2e8c + .long 0x3fb92144 + .long 0x3fb81703 + .long 0x3fb70fbb + .long 0x3fb60b61 + .long 0x3fb509e7 + .long 0x3fb40b41 + .long 0x3fb30f63 + .long 0x3fb21643 + .long 0x3fb11fd4 + .long 0x3fb02c0b + .long 0x3faf3ade + .long 0x3fae4c41 + .long 0x3fad602b + .long 0x3fac7692 + .long 0x3fab8f6a + .long 0x3faaaaab + .long 0x3fa9c84a + .long 0x3fa8e83f + .long 0x3fa80a81 + .long 0x3fa72f05 + .long 0x3fa655c4 + .long 0x3fa57eb5 + .long 0x3fa4a9cf + .long 0x3fa3d70a + .long 0x3fa3065e + .long 0x3fa237c3 + .long 0x3fa16b31 + .long 0x3fa0a0a1 + .long 0x3f9fd80a + .long 0x3f9f1166 + .long 0x3f9e4cad + .long 0x3f9d89d9 + .long 0x3f9cc8e1 + .long 0x3f9c09c1 + .long 0x3f9b4c70 + .long 0x3f9a90e8 + .long 0x3f99d723 + .long 0x3f991f1a + .long 0x3f9868c8 + .long 0x3f97b426 + .long 0x3f97012e + .long 0x3f964fda + .long 0x3f95a025 + .long 0x3f94f209 + .long 0x3f944581 + .long 0x3f939a86 + .long 0x3f92f114 + .long 0x3f924925 + .long 0x3f91a2b4 + .long 0x3f90fdbc + .long 0x3f905a38 + .long 0x3f8fb824 + .long 0x3f8f177a + .long 0x3f8e7835 + .long 0x3f8dda52 + .long 0x3f8d3dcb + .long 0x3f8ca29c + .long 0x3f8c08c1 + .long 0x3f8b7034 + .long 0x3f8ad8f3 + .long 0x3f8a42f8 + .long 0x3f89ae41 + .long 0x3f891ac7 + .long 0x3f888889 + .long 0x3f87f781 + .long 0x3f8767ab + .long 0x3f86d905 + .long 0x3f864b8a + .long 0x3f85bf37 + .long 0x3f853408 + .long 0x3f84a9fa + .long 0x3f842108 + .long 0x3f839930 + .long 0x3f83126f + .long 0x3f828cc0 + .long 0x3f820821 + .long 0x3f81848e + .long 0x3f810204 + .long 0x3f808081 + .long 0x3f800000 + +
diff --git a/src/gas/log2.S b/src/gas/log2.S new file mode 100644 index 0000000..0c791b5 --- /dev/null +++ b/src/gas/log2.S
@@ -0,0 +1,1132 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# +# log2.S +# +# An implementation of the log2 libm function. +# +# Prototype: +# +# double log2(double x); +# + +# +# Algorithm: +# Similar to one presnted in log.S +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(log2) +#define fname_special _log2_special@PLT + + +# local variable storage offsets +.equ p_temp, 0x0 +.equ stack_size, 0x18 + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + sub $stack_size, %rsp + + # compute exponent part + xor %rax, %rax + movdqa %xmm0, %xmm3 + movsd %xmm0, %xmm4 + psrlq $52, %xmm3 + movd %xmm0, %rax + psubq .L__mask_1023(%rip), %xmm3 + movdqa %xmm0, %xmm2 + cvtdq2pd %xmm3, %xmm6 # xexp + + # NaN or inf + movdqa %xmm0, %xmm5 + andpd .L__real_inf(%rip), %xmm5 + comisd .L__real_inf(%rip), %xmm5 + je .L__x_is_inf_or_nan + + # check for negative numbers or zero + xorpd %xmm5, %xmm5 + comisd %xmm5, %xmm0 + jbe .L__x_is_zero_or_neg + + pand .L__real_mant(%rip), %xmm2 + subsd .L__real_one(%rip), %xmm4 + + comisd .L__mask_1023_f(%rip), %xmm6 + je .L__denormal_adjust + +.L__continue_common: + + # compute index into the log tables + mov %rax, %r9 + and .L__mask_mant_all8(%rip), %rax + and .L__mask_mant9(%rip), %r9 + shl $1, %r9 + add %r9, %rax + mov %rax, p_temp(%rsp) + + # near one codepath + andpd .L__real_notsign(%rip), %xmm4 + comisd .L__real_threshold(%rip), %xmm4 + jb .L__near_one + + # F, Y + movsd p_temp(%rsp), %xmm1 + shr $44, %rax + por .L__real_half(%rip), %xmm2 + por .L__real_half(%rip), %xmm1 + lea .L__log_F_inv(%rip), %r9 + + # f = F - Y, r = f * inv + subsd %xmm2, %xmm1 + mulsd (%r9,%rax,8), %xmm1 + + movsd %xmm1, %xmm2 + movsd %xmm1, %xmm0 + lea .L__log_256_lead(%rip), %r9 + + # poly + movsd .L__real_1_over_6(%rip), %xmm3 + movsd .L__real_1_over_3(%rip), %xmm1 + mulsd %xmm2, %xmm3 + mulsd %xmm2, %xmm1 + mulsd %xmm2, %xmm0 + movsd %xmm0, %xmm4 + addsd .L__real_1_over_5(%rip), %xmm3 + addsd .L__real_1_over_2(%rip), %xmm1 + mulsd %xmm0, %xmm4 + mulsd %xmm2, %xmm3 + mulsd %xmm0, %xmm1 + addsd .L__real_1_over_4(%rip), %xmm3 + addsd %xmm2, %xmm1 + mulsd %xmm4, %xmm3 + addsd %xmm3, %xmm1 + + mulsd .L__real_log2_e(%rip), %xmm1 + + # m + log2(G) - poly*log2_e + movsd (%r9,%rax,8), %xmm0 + lea .L__log_256_tail(%rip), %rdx + movsd (%rdx,%rax,8), %xmm2 + subsd %xmm1, %xmm2 + + addsd %xmm6, %xmm0 + addsd %xmm2, %xmm0 + + add $stack_size, %rsp + ret + +.p2align 4,,15 +.L__near_one: + + # r = x - 1.0 + movsd .L__real_two(%rip), %xmm2 + subsd .L__real_one(%rip), %xmm0 # r + + addsd %xmm0, %xmm2 + movsd %xmm0, %xmm1 + divsd %xmm2, %xmm1 # r/(2+r) = u/2 + + movsd .L__real_ca2(%rip), %xmm4 + movsd .L__real_ca4(%rip), %xmm5 + + movsd %xmm0, %xmm6 + mulsd %xmm1, %xmm6 # correction + + addsd %xmm1, %xmm1 # u + movsd %xmm1, %xmm2 + + mulsd %xmm1, %xmm2 # u^2 + + mulsd %xmm2, %xmm4 + mulsd %xmm2, %xmm5 + + addsd .L__real_ca1(%rip), %xmm4 + addsd .L__real_ca3(%rip), %xmm5 + + mulsd %xmm1, %xmm2 # u^3 + mulsd %xmm2, %xmm4 + + mulsd %xmm2, %xmm2 + mulsd %xmm1, %xmm2 # u^7 + mulsd %xmm2, %xmm5 + + addsd %xmm5, %xmm4 + subsd %xmm6, %xmm4 + + movdqa %xmm0, %xmm3 + pand .L__mask_lower(%rip), %xmm3 + subsd %xmm3, %xmm0 + addsd %xmm0, %xmm4 + + movsd %xmm3, %xmm0 + movsd %xmm4, %xmm1 + + mulsd .L__real_log2_e_tail(%rip), %xmm4 + mulsd .L__real_log2_e_tail(%rip), %xmm0 + mulsd .L__real_log2_e_lead(%rip), %xmm1 + mulsd .L__real_log2_e_lead(%rip), %xmm3 + + addsd %xmm4, %xmm0 + addsd %xmm1, %xmm0 + addsd %xmm3, %xmm0 + + add $stack_size, %rsp + ret + +.p2align 4,,15 +.L__denormal_adjust: + por .L__real_one(%rip), %xmm2 + subsd .L__real_one(%rip), %xmm2 + movsd %xmm2, %xmm5 + pand .L__real_mant(%rip), %xmm2 + movd %xmm2, %rax + psrlq $52, %xmm5 + psubd .L__mask_2045(%rip), %xmm5 + cvtdq2pd %xmm5, %xmm6 + jmp .L__continue_common + +.p2align 4,,15 +.L__x_is_zero_or_neg: + jne .L__x_is_neg + + movsd .L__real_ninf(%rip), %xmm1 + mov .L__flag_x_zero(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__x_is_neg: + + movsd .L__real_qnan(%rip), %xmm1 + mov .L__flag_x_neg(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__x_is_inf_or_nan: + + cmp .L__real_inf(%rip), %rax + je .L__finish + + cmp .L__real_ninf(%rip), %rax + je .L__x_is_neg + + mov .L__real_qnanbit(%rip), %r9 + and %rax, %r9 + jnz .L__finish + + or .L__real_qnanbit(%rip), %rax + movd %rax, %xmm1 + mov .L__flag_x_nan(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__finish: + add $stack_size, %rsp + ret + + +.data + +.align 16 + +# these codes and the ones in the corresponding .c file have to match +.L__flag_x_zero: .long 00000001 +.L__flag_x_neg: .long 00000002 +.L__flag_x_nan: .long 00000003 + +.align 16 + +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0000000000000000 +.L__real_inf: .quad 0x7ff0000000000000 # +inf + .quad 0x0000000000000000 +.L__real_qnan: .quad 0x7ff8000000000000 # qNaN + .quad 0x0000000000000000 +.L__real_qnanbit: .quad 0x0008000000000000 + .quad 0x0000000000000000 +.L__real_mant: .quad 0x000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000000000000000 +.L__mask_1023: .quad 0x00000000000003ff + .quad 0x0000000000000000 +.L__mask_001: .quad 0x0000000000000001 + .quad 0x0000000000000000 + +.L__mask_mant_all8: .quad 0x000ff00000000000 + .quad 0x0000000000000000 +.L__mask_mant9: .quad 0x0000080000000000 + .quad 0x0000000000000000 + +.L__real_log2_e: .quad 0x3ff71547652b82fe + .quad 0x0000000000000000 + +.L__real_log2_e_lead: .quad 0x3ff7154400000000 # log2e_lead 1.44269180297851562500E+00 + .quad 0x0000000000000000 +.L__real_log2_e_tail: .quad 0x3ecb295c17f0bbbe # log2e_tail 3.23791044778235969970E-06 + .quad 0x0000000000000000 + +.L__real_two: .quad 0x4000000000000000 # 2 + .quad 0x0000000000000000 + +.L__real_one: .quad 0x3ff0000000000000 # 1 + .quad 0x0000000000000000 + +.L__real_half: .quad 0x3fe0000000000000 # 1/2 + .quad 0x0000000000000000 + +.L__mask_100: .quad 0x0000000000000100 + .quad 0x0000000000000000 + +.L__real_1_over_512: .quad 0x3f60000000000000 + .quad 0x0000000000000000 + +.L__real_1_over_2: .quad 0x3fe0000000000000 + .quad 0x0000000000000000 +.L__real_1_over_3: .quad 0x3fd5555555555555 + .quad 0x0000000000000000 +.L__real_1_over_4: .quad 0x3fd0000000000000 + .quad 0x0000000000000000 +.L__real_1_over_5: .quad 0x3fc999999999999a + .quad 0x0000000000000000 +.L__real_1_over_6: .quad 0x3fc5555555555555 + .quad 0x0000000000000000 + +.L__mask_1023_f: .quad 0x0c08ff80000000000 + .quad 0x0000000000000000 + +.L__mask_2045: .quad 0x00000000000007fd + .quad 0x0000000000000000 + +.L__real_threshold: .quad 0x3fb0000000000000 # .0625 + .quad 0x0000000000000000 + +.L__real_notsign: .quad 0x7ffFFFFFFFFFFFFF # ^sign bit + .quad 0x0000000000000000 + +.L__real_ca1: .quad 0x3fb55555555554e6 # 8.33333333333317923934e-02 + .quad 0x0000000000000000 +.L__real_ca2: .quad 0x3f89999999bac6d4 # 1.25000000037717509602e-02 + .quad 0x0000000000000000 +.L__real_ca3: .quad 0x3f62492307f1519f # 2.23213998791944806202e-03 + .quad 0x0000000000000000 +.L__real_ca4: .quad 0x3f3c8034c85dfff0 # 4.34887777707614552256e-04 + .quad 0x0000000000000000 + +.L__mask_lower: .quad 0x0ffffffff00000000 + .quad 0x0000000000000000 + +.align 16 +.L__log_256_lead: + .quad 0x0000000000000000 + .quad 0x3f7709c460000000 + .quad 0x3f86fe50b0000000 + .quad 0x3f91363110000000 + .quad 0x3f96e79680000000 + .quad 0x3f9c9363b0000000 + .quad 0x3fa11cd1d0000000 + .quad 0x3fa3ed3090000000 + .quad 0x3fa6bad370000000 + .quad 0x3fa985bfc0000000 + .quad 0x3fac4dfab0000000 + .quad 0x3faf138980000000 + .quad 0x3fb0eb3890000000 + .quad 0x3fb24b5b70000000 + .quad 0x3fb3aa2fd0000000 + .quad 0x3fb507b830000000 + .quad 0x3fb663f6f0000000 + .quad 0x3fb7beee90000000 + .quad 0x3fb918a160000000 + .quad 0x3fba7111d0000000 + .quad 0x3fbbc84240000000 + .quad 0x3fbd1e34e0000000 + .quad 0x3fbe72ec10000000 + .quad 0x3fbfc66a00000000 + .quad 0x3fc08c5880000000 + .quad 0x3fc134e1b0000000 + .quad 0x3fc1dcd190000000 + .quad 0x3fc2842940000000 + .quad 0x3fc32ae9e0000000 + .quad 0x3fc3d11460000000 + .quad 0x3fc476a9f0000000 + .quad 0x3fc51bab90000000 + .quad 0x3fc5c01a30000000 + .quad 0x3fc663f6f0000000 + .quad 0x3fc70742d0000000 + .quad 0x3fc7a9fec0000000 + .quad 0x3fc84c2bd0000000 + .quad 0x3fc8edcae0000000 + .quad 0x3fc98edd00000000 + .quad 0x3fca2f6320000000 + .quad 0x3fcacf5e20000000 + .quad 0x3fcb6ecf10000000 + .quad 0x3fcc0db6c0000000 + .quad 0x3fccac1630000000 + .quad 0x3fcd49ee40000000 + .quad 0x3fcde73fe0000000 + .quad 0x3fce840be0000000 + .quad 0x3fcf205330000000 + .quad 0x3fcfbc16b0000000 + .quad 0x3fd02baba0000000 + .quad 0x3fd0790ad0000000 + .quad 0x3fd0c62970000000 + .quad 0x3fd11307d0000000 + .quad 0x3fd15fa670000000 + .quad 0x3fd1ac05b0000000 + .quad 0x3fd1f825f0000000 + .quad 0x3fd24407a0000000 + .quad 0x3fd28fab30000000 + .quad 0x3fd2db10f0000000 + .quad 0x3fd3263960000000 + .quad 0x3fd37124c0000000 + .quad 0x3fd3bbd3a0000000 + .quad 0x3fd4064630000000 + .quad 0x3fd4507cf0000000 + .quad 0x3fd49a7840000000 + .quad 0x3fd4e43880000000 + .quad 0x3fd52dbdf0000000 + .quad 0x3fd5770910000000 + .quad 0x3fd5c01a30000000 + .quad 0x3fd608f1b0000000 + .quad 0x3fd6518fe0000000 + .quad 0x3fd699f520000000 + .quad 0x3fd6e221c0000000 + .quad 0x3fd72a1630000000 + .quad 0x3fd771d2b0000000 + .quad 0x3fd7b957a0000000 + .quad 0x3fd800a560000000 + .quad 0x3fd847bc30000000 + .quad 0x3fd88e9c70000000 + .quad 0x3fd8d54670000000 + .quad 0x3fd91bba80000000 + .quad 0x3fd961f900000000 + .quad 0x3fd9a80230000000 + .quad 0x3fd9edd670000000 + .quad 0x3fda337600000000 + .quad 0x3fda78e140000000 + .quad 0x3fdabe1870000000 + .quad 0x3fdb031be0000000 + .quad 0x3fdb47ebf0000000 + .quad 0x3fdb8c88d0000000 + .quad 0x3fdbd0f2e0000000 + .quad 0x3fdc152a60000000 + .quad 0x3fdc592fa0000000 + .quad 0x3fdc9d02f0000000 + .quad 0x3fdce0a490000000 + .quad 0x3fdd2414c0000000 + .quad 0x3fdd6753e0000000 + .quad 0x3fddaa6220000000 + .quad 0x3fdded3fd0000000 + .quad 0x3fde2fed30000000 + .quad 0x3fde726aa0000000 + .quad 0x3fdeb4b840000000 + .quad 0x3fdef6d670000000 + .quad 0x3fdf38c560000000 + .quad 0x3fdf7a8560000000 + .quad 0x3fdfbc16b0000000 + .quad 0x3fdffd7990000000 + .quad 0x3fe01f5720000000 + .quad 0x3fe03fda80000000 + .quad 0x3fe0604710000000 + .quad 0x3fe0809cf0000000 + .quad 0x3fe0a0dc30000000 + .quad 0x3fe0c10500000000 + .quad 0x3fe0e11770000000 + .quad 0x3fe10113b0000000 + .quad 0x3fe120f9d0000000 + .quad 0x3fe140c9f0000000 + .quad 0x3fe1608440000000 + .quad 0x3fe18028c0000000 + .quad 0x3fe19fb7b0000000 + .quad 0x3fe1bf3110000000 + .quad 0x3fe1de9510000000 + .quad 0x3fe1fde3d0000000 + .quad 0x3fe21d1d50000000 + .quad 0x3fe23c41d0000000 + .quad 0x3fe25b5150000000 + .quad 0x3fe27a4c00000000 + .quad 0x3fe29931f0000000 + .quad 0x3fe2b80340000000 + .quad 0x3fe2d6c010000000 + .quad 0x3fe2f56870000000 + .quad 0x3fe313fc80000000 + .quad 0x3fe3327c60000000 + .quad 0x3fe350e830000000 + .quad 0x3fe36f3ff0000000 + .quad 0x3fe38d83e0000000 + .quad 0x3fe3abb3f0000000 + .quad 0x3fe3c9d060000000 + .quad 0x3fe3e7d930000000 + .quad 0x3fe405ce80000000 + .quad 0x3fe423b070000000 + .quad 0x3fe4417f20000000 + .quad 0x3fe45f3a90000000 + .quad 0x3fe47ce2f0000000 + .quad 0x3fe49a7840000000 + .quad 0x3fe4b7fab0000000 + .quad 0x3fe4d56a50000000 + .quad 0x3fe4f2c740000000 + .quad 0x3fe5101180000000 + .quad 0x3fe52d4940000000 + .quad 0x3fe54a6e80000000 + .quad 0x3fe5678170000000 + .quad 0x3fe5848220000000 + .quad 0x3fe5a170a0000000 + .quad 0x3fe5be4d00000000 + .quad 0x3fe5db1770000000 + .quad 0x3fe5f7cff0000000 + .quad 0x3fe61476a0000000 + .quad 0x3fe6310b80000000 + .quad 0x3fe64d8ed0000000 + .quad 0x3fe66a0080000000 + .quad 0x3fe68660c0000000 + .quad 0x3fe6a2af90000000 + .quad 0x3fe6beed20000000 + .quad 0x3fe6db1960000000 + .quad 0x3fe6f73480000000 + .quad 0x3fe7133e90000000 + .quad 0x3fe72f37a0000000 + .quad 0x3fe74b1fd0000000 + .quad 0x3fe766f720000000 + .quad 0x3fe782bdb0000000 + .quad 0x3fe79e73a0000000 + .quad 0x3fe7ba18f0000000 + .quad 0x3fe7d5adc0000000 + .quad 0x3fe7f13220000000 + .quad 0x3fe80ca620000000 + .quad 0x3fe82809d0000000 + .quad 0x3fe8435d50000000 + .quad 0x3fe85ea0b0000000 + .quad 0x3fe879d3f0000000 + .quad 0x3fe894f740000000 + .quad 0x3fe8b00aa0000000 + .quad 0x3fe8cb0e30000000 + .quad 0x3fe8e60200000000 + .quad 0x3fe900e610000000 + .quad 0x3fe91bba80000000 + .quad 0x3fe9367f60000000 + .quad 0x3fe95134d0000000 + .quad 0x3fe96bdad0000000 + .quad 0x3fe9867170000000 + .quad 0x3fe9a0f8d0000000 + .quad 0x3fe9bb70f0000000 + .quad 0x3fe9d5d9f0000000 + .quad 0x3fe9f033e0000000 + .quad 0x3fea0a7ed0000000 + .quad 0x3fea24bad0000000 + .quad 0x3fea3ee7f0000000 + .quad 0x3fea590640000000 + .quad 0x3fea7315d0000000 + .quad 0x3fea8d16b0000000 + .quad 0x3feaa708f0000000 + .quad 0x3feac0eca0000000 + .quad 0x3feadac1e0000000 + .quad 0x3feaf488b0000000 + .quad 0x3feb0e4120000000 + .quad 0x3feb27eb40000000 + .quad 0x3feb418730000000 + .quad 0x3feb5b14f0000000 + .quad 0x3feb749480000000 + .quad 0x3feb8e0620000000 + .quad 0x3feba769b0000000 + .quad 0x3febc0bf50000000 + .quad 0x3febda0710000000 + .quad 0x3febf34110000000 + .quad 0x3fec0c6d40000000 + .quad 0x3fec258bc0000000 + .quad 0x3fec3e9ca0000000 + .quad 0x3fec579fe0000000 + .quad 0x3fec7095a0000000 + .quad 0x3fec897df0000000 + .quad 0x3feca258d0000000 + .quad 0x3fecbb2660000000 + .quad 0x3fecd3e6a0000000 + .quad 0x3fecec9990000000 + .quad 0x3fed053f60000000 + .quad 0x3fed1dd810000000 + .quad 0x3fed3663b0000000 + .quad 0x3fed4ee240000000 + .quad 0x3fed6753e0000000 + .quad 0x3fed7fb890000000 + .quad 0x3fed981060000000 + .quad 0x3fedb05b60000000 + .quad 0x3fedc899a0000000 + .quad 0x3fede0cb30000000 + .quad 0x3fedf8f020000000 + .quad 0x3fee110860000000 + .quad 0x3fee291420000000 + .quad 0x3fee411360000000 + .quad 0x3fee590630000000 + .quad 0x3fee70eca0000000 + .quad 0x3fee88c6b0000000 + .quad 0x3feea09470000000 + .quad 0x3feeb855f0000000 + .quad 0x3feed00b40000000 + .quad 0x3feee7b470000000 + .quad 0x3feeff5180000000 + .quad 0x3fef16e280000000 + .quad 0x3fef2e6780000000 + .quad 0x3fef45e080000000 + .quad 0x3fef5d4da0000000 + .quad 0x3fef74aef0000000 + .quad 0x3fef8c0460000000 + .quad 0x3fefa34e10000000 + .quad 0x3fefba8c00000000 + .quad 0x3fefd1be40000000 + .quad 0x3fefe8e4f0000000 + .quad 0x3ff0000000000000 + +.align 16 +.L__log_256_tail: + .quad 0x0000000000000000 + .quad 0x3deaf558ee95b37a + .quad 0x3debbc2145fe38de + .quad 0x3dfea5ec312ed069 + .quad 0x3df70b48a629b89f + .quad 0x3e050a1f0cccdd01 + .quad 0x3e044cd04bb60514 + .quad 0x3e01a16898809d2d + .quad 0x3e063bf61cc4d81b + .quad 0x3dfa4a8ca305071d + .quad 0x3e121556bde9f1f0 + .quad 0x3df9929cfd0e6835 + .quad 0x3e2f453f35679ee9 + .quad 0x3e2c26b47913459e + .quad 0x3e2a4fe385b009a2 + .quad 0x3e180ceedb53cb4d + .quad 0x3e2592262cf998a7 + .quad 0x3e1ae28a04f106b8 + .quad 0x3e2c8c66b55ce464 + .quad 0x3e2e690927d688b0 + .quad 0x3de5b5774c7658b4 + .quad 0x3e0adc16d26859c7 + .quad 0x3df7fa5b21cbdb5d + .quad 0x3e2e160149209a68 + .quad 0x3e39b4f3c72c4f78 + .quad 0x3e222418b7fcd690 + .quad 0x3e2d54aded7a9150 + .quad 0x3e360f4c7f1aed15 + .quad 0x3e13c570d0fa8f96 + .quad 0x3e3b3514c7e0166e + .quad 0x3e3307ee9a6271d2 + .quad 0x3dee9722922c0226 + .quad 0x3e33f7ad0f3f4016 + .quad 0x3e3592262cf998a7 + .quad 0x3e23bc09fca70073 + .quad 0x3e2f41777bc5f936 + .quad 0x3dd781d97ee91247 + .quad 0x3e306a56d76b9a84 + .quad 0x3e2df9c37c0beb3a + .quad 0x3e1905c35651c429 + .quad 0x3e3b69d927dfc23d + .quad 0x3e2d7e57a5afb633 + .quad 0x3e3bb29bdc81c4db + .quad 0x3e38ee1b912d8994 + .quad 0x3e3864b2df91e96a + .quad 0x3e1d8a40770df213 + .quad 0x3e2d39a9331f27cf + .quad 0x3e32411e4e8eea54 + .quad 0x3e3204d0144751b3 + .quad 0x3e2268331dd8bd0b + .quad 0x3e47606012de0634 + .quad 0x3e3550aa3a93ec7e + .quad 0x3e45a616eb9612e0 + .quad 0x3e3aec23fd65f8e1 + .quad 0x3e248f838294639c + .quad 0x3e3b62384cafa1a3 + .quad 0x3e461c0e73048b72 + .quad 0x3e36cc9a0d8c0e85 + .quad 0x3e489b355ede26f4 + .quad 0x3e2b5941acd71f1e + .quad 0x3e4d499bd9b32266 + .quad 0x3e043b9f52b061ba + .quad 0x3e46360892eb65e6 + .quad 0x3e4dba9f8729ab41 + .quad 0x3e479a3715fc9257 + .quad 0x3e0d1f6d3f77ae38 + .quad 0x3e48992d66fb9ec1 + .quad 0x3e4666f195620f03 + .quad 0x3e43f7ad0f3f4016 + .quad 0x3e30a522b65bc039 + .quad 0x3e319dee9b9489e3 + .quad 0x3e323352e1a31521 + .quad 0x3e4b3a19bcaf1aa4 + .quad 0x3e3f2f060a50d366 + .quad 0x3e44fdf677c8dfd9 + .quad 0x3e48a35588aec6df + .quad 0x3e28b0e2a19575b0 + .quad 0x3e2ec30c6e3e04a7 + .quad 0x3e2705912d25b325 + .quad 0x3e2dae1b8d59e849 + .quad 0x3e423e2e1169656a + .quad 0x3e349d026e33d675 + .quad 0x3e423c465e6976da + .quad 0x3e366c977e236c73 + .quad 0x3e44fec0a13af881 + .quad 0x3e3bdefbd14a0816 + .quad 0x3e42fe3e91c348e4 + .quad 0x3e4fc0c868ccc02d + .quad 0x3e3ce20a829051bb + .quad 0x3e47f10cf32e6bba + .quad 0x3e43cf2061568859 + .quad 0x3e484995cb804b94 + .quad 0x3e4a52b6acfcfdca + .quad 0x3e3b291ecf4dff1e + .quad 0x3e21d2c3e64ae851 + .quad 0x3e4017e4faa42b7d + .quad 0x3de975077f1f5f0c + .quad 0x3e20327dc8093a52 + .quad 0x3e3108d9313aec65 + .quad 0x3e4a12e5301be44a + .quad 0x3e1e754d20c519e1 + .quad 0x3e3f456f394f9727 + .quad 0x3e29471103e8f00d + .quad 0x3e3ef3150343f8df + .quad 0x3e41960d9d9c3263 + .quad 0x3e4204d0144751b3 + .quad 0x3e4507ff357398fe + .quad 0x3e4dc9937fc8cafd + .quad 0x3e572f32fe672868 + .quad 0x3e53e49d647d323e + .quad 0x3e33fb81ea92d9e0 + .quad 0x3e43e387ef003635 + .quad 0x3e1ac754cb104aea + .quad 0x3e4535f0444ebaaf + .quad 0x3e253c8ea7b1cdda + .quad 0x3e3cf0c0396a568b + .quad 0x3e5543ca873c2b4a + .quad 0x3e425780181e2b37 + .quad 0x3e5ee52ed49d71d2 + .quad 0x3e51e64842e2c386 + .quad 0x3e5d2ba01bc76a27 + .quad 0x3e5b39774c30f499 + .quad 0x3e38740932120aea + .quad 0x3e576dab3462a1e8 + .quad 0x3e409c9f20203b31 + .quad 0x3e516e7a08ad0d1a + .quad 0x3e46172fe015e13b + .quad 0x3e49e4558147cf67 + .quad 0x3e4cfdeb43cfd005 + .quad 0x3e3a809c03254a71 + .quad 0x3e47acfc98509e33 + .quad 0x3e54366de473e474 + .quad 0x3e5569394d90d724 + .quad 0x3e32b83ec743664c + .quad 0x3e56db22c4808ee5 + .quad 0x3df7ae84940df0e1 + .quad 0x3e554042cd999564 + .quad 0x3e4242b8488b3056 + .quad 0x3e4e7dc059ab8a9e + .quad 0x3e5a71e977d7da5f + .quad 0x3e5d30d552ce0ec3 + .quad 0x3e43208592b6c6b7 + .quad 0x3e51440e7149afff + .quad 0x3e36812c371a1c87 + .quad 0x3e579a3715fc9257 + .quad 0x3e57c92f2af8b0ca + .quad 0x3e56679d8894dbdf + .quad 0x3e2a9f33e77507f0 + .quad 0x3e4c22a3e377a524 + .quad 0x3e3723c84a77a4dc + .quad 0x3e594a871b636194 + .quad 0x3e570d6058f62f4d + .quad 0x3e4a6274cf0e362f + .quad 0x3e42fe3570af1a0b + .quad 0x3e596a286955d67e + .quad 0x3e442104f127091e + .quad 0x3e407826bae32c6b + .quad 0x3df8d8844ce77237 + .quad 0x3e5eaa609080d4b4 + .quad 0x3e4dc66fbe61efc4 + .quad 0x3e5c8f11979a5db6 + .quad 0x3e52dedf0e6f1770 + .quad 0x3e5cb41e1410132a + .quad 0x3e32866d705c553d + .quad 0x3e54ec3293b2fbe0 + .quad 0x3e578b8c2f4d0fe1 + .quad 0x3e562ad8f7ca2cff + .quad 0x3e5a298b5f973a2c + .quad 0x3e49381d4f1b95e0 + .quad 0x3e564c7bdb9bc56c + .quad 0x3e5fbb4caef790fc + .quad 0x3e51200c3f899927 + .quad 0x3e526a05c813d56e + .quad 0x3e4681e2910108ee + .quad 0x3e282cf15d12ecd7 + .quad 0x3e0a537e32446892 + .quad 0x3e46f9c1cb6f7010 + .quad 0x3e4328ddcedf39d8 + .quad 0x3e164f64c210df9d + .quad 0x3e58f676e17cc811 + .quad 0x3e560ddf1680dd45 + .quad 0x3e5e2da951c2d91b + .quad 0x3e5696777b66d115 + .quad 0x3e311eb3043f5601 + .quad 0x3e48000b33f90fd4 + .quad 0x3e523e2e1169656a + .quad 0x3e5b41565d3990cb + .quad 0x3e46138b8d9d31e6 + .quad 0x3e3565afaf7f6248 + .quad 0x3e4b68e0ba153594 + .quad 0x3e3d87027ef4ab9a + .quad 0x3e556b9c99085939 + .quad 0x3e5aa02166cccab2 + .quad 0x3e5991d2aca399a1 + .quad 0x3e54982259cc625d + .quad 0x3e4b9feddaab9820 + .quad 0x3e3c70c0f683cc68 + .quad 0x3e213156425e67e5 + .quad 0x3df79063deab051f + .quad 0x3e27e2744b2b8ca5 + .quad 0x3e4600534df378df + .quad 0x3e59322676507a79 + .quad 0x3e4c4720cb4558b5 + .quad 0x3e445e4b56add63a + .quad 0x3e4af321af5e9bb5 + .quad 0x3e57f1e1148dad64 + .quad 0x3e42a4022f65e2e6 + .quad 0x3e11f2ccbcd0d3cc + .quad 0x3e5eaa65b49696e2 + .quad 0x3e110e6123a74764 + .quad 0x3e3cf24b2077c3f6 + .quad 0x3e4fc8d8164754da + .quad 0x3e598cfcdb6a2dbc + .quad 0x3e24464a6bcdf47b + .quad 0x3e41f1774d8b66a6 + .quad 0x3e459920a2adf6fa + .quad 0x3e370d02a99b4c5a + .quad 0x3e576b6cafa2532d + .quad 0x3e5d23c38ec17936 + .quad 0x3e541b6b1b0e66c4 + .quad 0x3e5952662c6bfdc7 + .quad 0x3e4333f3d6bb35ec + .quad 0x3e195120d8486e92 + .quad 0x3e5db8a405fac56e + .quad 0x3e5a4c112ce6312e + .quad 0x3e536987e1924e45 + .quad 0x3e33f98ea94bc1bd + .quad 0x3e459718aacb6ec7 + .quad 0x3df975077f1f5f0c + .quad 0x3e13654d88f20500 + .quad 0x3e40f598530f101b + .quad 0x3e5145f6c94f7fd7 + .quad 0x3e567fead8bcce75 + .quad 0x3e52e67148cd0a7b + .quad 0x3e10d5e5897de907 + .quad 0x3e5b5ee92c53d919 + .quad 0x3e5c1c02803f7554 + .quad 0x3e5d5caa7a35c9f7 + .quad 0x3e5910459cac3223 + .quad 0x3e41fbe1bb98afdf + .quad 0x3e3b135395510d1e + .quad 0x3e47b8f0e7b8e757 + .quad 0x3e519511f61a96b8 + .quad 0x3e5117d846ae1f8e + .quad 0x3e2b3a9507d6dc1f + .quad 0x3e15fa7c78c9e676 + .quad 0x3e2db76303b21928 + .quad 0x3e27eb8450ac22ed + .quad 0x3e579e0caa9c9ab7 + .quad 0x3e59de6d7cba1bbe + .quad 0x3e1df5f5baf436cb + .quad 0x3e3e746344728dbf + .quad 0x3e277c23362928b9 + .quad 0x3e4715137cfeba9f + .quad 0x3e58fe55f2856443 + .quad 0x3e25bd1a025d9e24 + .quad 0x0000000000000000 + +.align 16 +.L__log_F_inv: + .quad 0x4000000000000000 + .quad 0x3fffe01fe01fe020 + .quad 0x3fffc07f01fc07f0 + .quad 0x3fffa11caa01fa12 + .quad 0x3fff81f81f81f820 + .quad 0x3fff6310aca0dbb5 + .quad 0x3fff44659e4a4271 + .quad 0x3fff25f644230ab5 + .quad 0x3fff07c1f07c1f08 + .quad 0x3ffee9c7f8458e02 + .quad 0x3ffecc07b301ecc0 + .quad 0x3ffeae807aba01eb + .quad 0x3ffe9131abf0b767 + .quad 0x3ffe741aa59750e4 + .quad 0x3ffe573ac901e574 + .quad 0x3ffe3a9179dc1a73 + .quad 0x3ffe1e1e1e1e1e1e + .quad 0x3ffe01e01e01e01e + .quad 0x3ffde5d6e3f8868a + .quad 0x3ffdca01dca01dca + .quad 0x3ffdae6076b981db + .quad 0x3ffd92f2231e7f8a + .quad 0x3ffd77b654b82c34 + .quad 0x3ffd5cac807572b2 + .quad 0x3ffd41d41d41d41d + .quad 0x3ffd272ca3fc5b1a + .quad 0x3ffd0cb58f6ec074 + .quad 0x3ffcf26e5c44bfc6 + .quad 0x3ffcd85689039b0b + .quad 0x3ffcbe6d9601cbe7 + .quad 0x3ffca4b3055ee191 + .quad 0x3ffc8b265afb8a42 + .quad 0x3ffc71c71c71c71c + .quad 0x3ffc5894d10d4986 + .quad 0x3ffc3f8f01c3f8f0 + .quad 0x3ffc26b5392ea01c + .quad 0x3ffc0e070381c0e0 + .quad 0x3ffbf583ee868d8b + .quad 0x3ffbdd2b899406f7 + .quad 0x3ffbc4fd65883e7b + .quad 0x3ffbacf914c1bad0 + .quad 0x3ffb951e2b18ff23 + .quad 0x3ffb7d6c3dda338b + .quad 0x3ffb65e2e3beee05 + .quad 0x3ffb4e81b4e81b4f + .quad 0x3ffb37484ad806ce + .quad 0x3ffb2036406c80d9 + .quad 0x3ffb094b31d922a4 + .quad 0x3ffaf286bca1af28 + .quad 0x3ffadbe87f94905e + .quad 0x3ffac5701ac5701b + .quad 0x3ffaaf1d2f87ebfd + .quad 0x3ffa98ef606a63be + .quad 0x3ffa82e65130e159 + .quad 0x3ffa6d01a6d01a6d + .quad 0x3ffa574107688a4a + .quad 0x3ffa41a41a41a41a + .quad 0x3ffa2c2a87c51ca0 + .quad 0x3ffa16d3f97a4b02 + .quad 0x3ffa01a01a01a01a + .quad 0x3ff9ec8e951033d9 + .quad 0x3ff9d79f176b682d + .quad 0x3ff9c2d14ee4a102 + .quad 0x3ff9ae24ea5510da + .quad 0x3ff999999999999a + .quad 0x3ff9852f0d8ec0ff + .quad 0x3ff970e4f80cb872 + .quad 0x3ff95cbb0be377ae + .quad 0x3ff948b0fcd6e9e0 + .quad 0x3ff934c67f9b2ce6 + .quad 0x3ff920fb49d0e229 + .quad 0x3ff90d4f120190d5 + .quad 0x3ff8f9c18f9c18fa + .quad 0x3ff8e6527af1373f + .quad 0x3ff8d3018d3018d3 + .quad 0x3ff8bfce8062ff3a + .quad 0x3ff8acb90f6bf3aa + .quad 0x3ff899c0f601899c + .quad 0x3ff886e5f0abb04a + .quad 0x3ff87427bcc092b9 + .quad 0x3ff8618618618618 + .quad 0x3ff84f00c2780614 + .quad 0x3ff83c977ab2bedd + .quad 0x3ff82a4a0182a4a0 + .quad 0x3ff8181818181818 + .quad 0x3ff8060180601806 + .quad 0x3ff7f405fd017f40 + .quad 0x3ff7e225515a4f1d + .quad 0x3ff7d05f417d05f4 + .quad 0x3ff7beb3922e017c + .quad 0x3ff7ad2208e0ecc3 + .quad 0x3ff79baa6bb6398b + .quad 0x3ff78a4c8178a4c8 + .quad 0x3ff77908119ac60d + .quad 0x3ff767dce434a9b1 + .quad 0x3ff756cac201756d + .quad 0x3ff745d1745d1746 + .quad 0x3ff734f0c541fe8d + .quad 0x3ff724287f46debc + .quad 0x3ff713786d9c7c09 + .quad 0x3ff702e05c0b8170 + .quad 0x3ff6f26016f26017 + .quad 0x3ff6e1f76b4337c7 + .quad 0x3ff6d1a62681c861 + .quad 0x3ff6c16c16c16c17 + .quad 0x3ff6b1490aa31a3d + .quad 0x3ff6a13cd1537290 + .quad 0x3ff691473a88d0c0 + .quad 0x3ff6816816816817 + .quad 0x3ff6719f3601671a + .quad 0x3ff661ec6a5122f9 + .quad 0x3ff6524f853b4aa3 + .quad 0x3ff642c8590b2164 + .quad 0x3ff63356b88ac0de + .quad 0x3ff623fa77016240 + .quad 0x3ff614b36831ae94 + .quad 0x3ff6058160581606 + .quad 0x3ff5f66434292dfc + .quad 0x3ff5e75bb8d015e7 + .quad 0x3ff5d867c3ece2a5 + .quad 0x3ff5c9882b931057 + .quad 0x3ff5babcc647fa91 + .quad 0x3ff5ac056b015ac0 + .quad 0x3ff59d61f123ccaa + .quad 0x3ff58ed2308158ed + .quad 0x3ff5805601580560 + .quad 0x3ff571ed3c506b3a + .quad 0x3ff56397ba7c52e2 + .quad 0x3ff5555555555555 + .quad 0x3ff54725e6bb82fe + .quad 0x3ff5390948f40feb + .quad 0x3ff52aff56a8054b + .quad 0x3ff51d07eae2f815 + .quad 0x3ff50f22e111c4c5 + .quad 0x3ff5015015015015 + .quad 0x3ff4f38f62dd4c9b + .quad 0x3ff4e5e0a72f0539 + .quad 0x3ff4d843bedc2c4c + .quad 0x3ff4cab88725af6e + .quad 0x3ff4bd3edda68fe1 + .quad 0x3ff4afd6a052bf5b + .quad 0x3ff4a27fad76014a + .quad 0x3ff49539e3b2d067 + .quad 0x3ff4880522014880 + .quad 0x3ff47ae147ae147b + .quad 0x3ff46dce34596066 + .quad 0x3ff460cbc7f5cf9a + .quad 0x3ff453d9e2c776ca + .quad 0x3ff446f86562d9fb + .quad 0x3ff43a2730abee4d + .quad 0x3ff42d6625d51f87 + .quad 0x3ff420b5265e5951 + .quad 0x3ff4141414141414 + .quad 0x3ff40782d10e6566 + .quad 0x3ff3fb013fb013fb + .quad 0x3ff3ee8f42a5af07 + .quad 0x3ff3e22cbce4a902 + .quad 0x3ff3d5d991aa75c6 + .quad 0x3ff3c995a47babe7 + .quad 0x3ff3bd60d9232955 + .quad 0x3ff3b13b13b13b14 + .quad 0x3ff3a524387ac822 + .quad 0x3ff3991c2c187f63 + .quad 0x3ff38d22d366088e + .quad 0x3ff3813813813814 + .quad 0x3ff3755bd1c945ee + .quad 0x3ff3698df3de0748 + .quad 0x3ff35dce5f9f2af8 + .quad 0x3ff3521cfb2b78c1 + .quad 0x3ff34679ace01346 + .quad 0x3ff33ae45b57bcb2 + .quad 0x3ff32f5ced6a1dfa + .quad 0x3ff323e34a2b10bf + .quad 0x3ff3187758e9ebb6 + .quad 0x3ff30d190130d190 + .quad 0x3ff301c82ac40260 + .quad 0x3ff2f684bda12f68 + .quad 0x3ff2eb4ea1fed14b + .quad 0x3ff2e025c04b8097 + .quad 0x3ff2d50a012d50a0 + .quad 0x3ff2c9fb4d812ca0 + .quad 0x3ff2bef98e5a3711 + .quad 0x3ff2b404ad012b40 + .quad 0x3ff2a91c92f3c105 + .quad 0x3ff29e4129e4129e + .quad 0x3ff293725bb804a5 + .quad 0x3ff288b01288b013 + .quad 0x3ff27dfa38a1ce4d + .quad 0x3ff27350b8812735 + .quad 0x3ff268b37cd60127 + .quad 0x3ff25e22708092f1 + .quad 0x3ff2539d7e9177b2 + .quad 0x3ff2492492492492 + .quad 0x3ff23eb79717605b + .quad 0x3ff23456789abcdf + .quad 0x3ff22a0122a0122a + .quad 0x3ff21fb78121fb78 + .quad 0x3ff21579804855e6 + .quad 0x3ff20b470c67c0d9 + .quad 0x3ff2012012012012 + .quad 0x3ff1f7047dc11f70 + .quad 0x3ff1ecf43c7fb84c + .quad 0x3ff1e2ef3b3fb874 + .quad 0x3ff1d8f5672e4abd + .quad 0x3ff1cf06ada2811d + .quad 0x3ff1c522fc1ce059 + .quad 0x3ff1bb4a4046ed29 + .quad 0x3ff1b17c67f2bae3 + .quad 0x3ff1a7b9611a7b96 + .quad 0x3ff19e0119e0119e + .quad 0x3ff19453808ca29c + .quad 0x3ff18ab083902bdb + .quad 0x3ff1811811811812 + .quad 0x3ff1778a191bd684 + .quad 0x3ff16e0689427379 + .quad 0x3ff1648d50fc3201 + .quad 0x3ff15b1e5f75270d + .quad 0x3ff151b9a3fdd5c9 + .quad 0x3ff1485f0e0acd3b + .quad 0x3ff13f0e8d344724 + .quad 0x3ff135c81135c811 + .quad 0x3ff12c8b89edc0ac + .quad 0x3ff12358e75d3033 + .quad 0x3ff11a3019a74826 + .quad 0x3ff1111111111111 + .quad 0x3ff107fbbe011080 + .quad 0x3ff0fef010fef011 + .quad 0x3ff0f5edfab325a2 + .quad 0x3ff0ecf56be69c90 + .quad 0x3ff0e40655826011 + .quad 0x3ff0db20a88f4696 + .quad 0x3ff0d24456359e3a + .quad 0x3ff0c9714fbcda3b + .quad 0x3ff0c0a7868b4171 + .quad 0x3ff0b7e6ec259dc8 + .quad 0x3ff0af2f722eecb5 + .quad 0x3ff0a6810a6810a7 + .quad 0x3ff09ddba6af8360 + .quad 0x3ff0953f39010954 + .quad 0x3ff08cabb37565e2 + .quad 0x3ff0842108421084 + .quad 0x3ff07b9f29b8eae2 + .quad 0x3ff073260a47f7c6 + .quad 0x3ff06ab59c7912fb + .quad 0x3ff0624dd2f1a9fc + .quad 0x3ff059eea0727586 + .quad 0x3ff05197f7d73404 + .quad 0x3ff04949cc1664c5 + .quad 0x3ff0410410410410 + .quad 0x3ff038c6b78247fc + .quad 0x3ff03091b51f5e1a + .quad 0x3ff02864fc7729e9 + .quad 0x3ff0204081020408 + .quad 0x3ff0182436517a37 + .quad 0x3ff0101010101010 + .quad 0x3ff0080402010080 + .quad 0x3ff0000000000000 + .quad 0x0000000000000000 + +
diff --git a/src/gas/log2f.S b/src/gas/log2f.S new file mode 100644 index 0000000..5361e0f --- /dev/null +++ b/src/gas/log2f.S
@@ -0,0 +1,738 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# +# log2f.S +# +# An implementation of the log2f libm function. +# +# Prototype: +# +# float log2f(float x); +# + +# +# Algorithm: +# Similar to one presnted in log.S +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(log2f) +#define fname_special _log2f_special@PLT + + +# local variable storage offsets +.equ p_temp, 0x0 +.equ stack_size, 0x18 + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + sub $stack_size, %rsp + + # compute exponent part + xor %eax, %eax + movdqa %xmm0, %xmm3 + movss %xmm0, %xmm4 + psrld $23, %xmm3 + movd %xmm0, %eax + psubd .L__mask_127(%rip), %xmm3 + movdqa %xmm0, %xmm2 + cvtdq2ps %xmm3, %xmm5 # xexp + + # NaN or inf + movdqa %xmm0, %xmm1 + andps .L__real_inf(%rip), %xmm1 + comiss .L__real_inf(%rip), %xmm1 + je .L__x_is_inf_or_nan + + # check for negative numbers or zero + xorps %xmm1, %xmm1 + comiss %xmm1, %xmm0 + jbe .L__x_is_zero_or_neg + + pand .L__real_mant(%rip), %xmm2 + subss .L__real_one(%rip), %xmm4 + + comiss .L__real_neg127(%rip), %xmm5 + je .L__denormal_adjust + +.L__continue_common: + + # compute index into the log tables + mov %eax, %r9d + and .L__mask_mant_all7(%rip), %eax + and .L__mask_mant8(%rip), %r9d + shl $1, %r9d + add %r9d, %eax + mov %eax, p_temp(%rsp) + + # near one codepath + andps .L__real_notsign(%rip), %xmm4 + comiss .L__real_threshold(%rip), %xmm4 + jb .L__near_one + + # F, Y + movss p_temp(%rsp), %xmm1 + shr $16, %eax + por .L__real_half(%rip), %xmm2 + por .L__real_half(%rip), %xmm1 + lea .L__log_F_inv(%rip), %r9 + + # f = F - Y, r = f * inv + subss %xmm2, %xmm1 + mulss (%r9,%rax,4), %xmm1 + + movss %xmm1, %xmm2 + movss %xmm1, %xmm0 + + # poly + mulss .L__real_1_over_3(%rip), %xmm2 + mulss %xmm1, %xmm0 + addss .L__real_1_over_2(%rip), %xmm2 + + lea .L__log_128_tail(%rip), %r9 + lea .L__log_128_lead(%rip), %r10 + + mulss %xmm0, %xmm2 + movss (%r9,%rax,4), %xmm3 + addss %xmm2, %xmm1 + + mulss .L__real_log2_e(%rip), %xmm1 + + # m + log2(G) - poly*log2_e + subss %xmm1, %xmm3 + movss %xmm3, %xmm0 + addss (%r10,%rax,4), %xmm5 + addss %xmm5, %xmm0 + + add $stack_size, %rsp + ret + +.p2align 4,,15 +.L__near_one: + # r = x - 1.0# + movss .L__real_two(%rip), %xmm2 + subss .L__real_one(%rip), %xmm0 + + # u = r / (2.0 + r) + addss %xmm0, %xmm2 + movss %xmm0, %xmm1 + divss %xmm2, %xmm1 # u + + # correction = r * u + movss %xmm0, %xmm4 + mulss %xmm1, %xmm4 + + # u = u + u# + addss %xmm1, %xmm1 + movss %xmm1, %xmm2 + mulss %xmm2, %xmm2 # v = u^2 + + # r2 = (u * v * (ca_1 + v * ca_2) - correction) + movss %xmm1, %xmm3 + mulss %xmm2, %xmm3 # u^3 + mulss .L__real_ca2(%rip), %xmm2 # Bu^2 + addss .L__real_ca1(%rip), %xmm2 # +A + mulss %xmm3, %xmm2 + subss %xmm4, %xmm2 # -correction + + movdqa %xmm0, %xmm5 + pand .L__mask_lower(%rip), %xmm5 + subss %xmm5, %xmm0 + addss %xmm0, %xmm2 + + movss %xmm5, %xmm0 + movss %xmm2, %xmm1 + + mulss .L__real_log2_e_tail(%rip), %xmm2 + mulss .L__real_log2_e_tail(%rip), %xmm0 + mulss .L__real_log2_e_lead(%rip), %xmm1 + mulss .L__real_log2_e_lead(%rip), %xmm5 + + addss %xmm2, %xmm0 + addss %xmm1, %xmm0 + addss %xmm5, %xmm0 + + add $stack_size, %rsp + ret + +.p2align 4,,15 +.L__denormal_adjust: + por .L__real_one(%rip), %xmm2 + subss .L__real_one(%rip), %xmm2 + movdqa %xmm2, %xmm5 + pand .L__real_mant(%rip), %xmm2 + movd %xmm2, %eax + psrld $23, %xmm5 + psubd .L__mask_253(%rip), %xmm5 + cvtdq2ps %xmm5, %xmm5 + jmp .L__continue_common + +.p2align 4,,15 +.L__x_is_zero_or_neg: + jne .L__x_is_neg + + movss .L__real_ninf(%rip), %xmm1 + mov .L__flag_x_zero(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__x_is_neg: + + movss .L__real_nan(%rip), %xmm1 + mov .L__flag_x_neg(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__x_is_inf_or_nan: + + cmp .L__real_inf(%rip), %eax + je .L__finish + + cmp .L__real_ninf(%rip), %eax + je .L__x_is_neg + + mov .L__real_qnanbit(%rip), %r9d + and %eax, %r9d + jnz .L__finish + + or .L__real_qnanbit(%rip), %eax + movd %eax, %xmm1 + mov .L__flag_x_nan(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__finish: + add $stack_size, %rsp + ret + + +.data + +.align 16 + +# these codes and the ones in the corresponding .c file have to match +.L__flag_x_zero: .long 00000001 +.L__flag_x_neg: .long 00000002 +.L__flag_x_nan: .long 00000003 + +.align 16 + +.L__real_one: .quad 0x03f8000003f800000 # 1.0 + .quad 0x03f8000003f800000 +.L__real_two: .quad 0x04000000040000000 # 1.0 + .quad 0x04000000040000000 +.L__real_ninf: .quad 0x0ff800000ff800000 # -inf + .quad 0x0ff800000ff800000 +.L__real_inf: .quad 0x07f8000007f800000 # +inf + .quad 0x07f8000007f800000 +.L__real_nan: .quad 0x07fc000007fc00000 # NaN + .quad 0x07fc000007fc00000 +.L__real_ef: .quad 0x0402DF854402DF854 # float e + .quad 0x0402DF854402DF854 +.L__real_neg_qnan: .quad 0x0ffc00000ffc00000 + .quad 0x0ffc00000ffc00000 + +.L__real_sign: .quad 0x08000000080000000 # sign bit + .quad 0x08000000080000000 +.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit + .quad 0x07ffFFFFF7ffFFFFF +.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit + .quad 0x00040000000400000 +.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantissa bits + .quad 0x0007FFFFF007FFFFF +.L__mask_127: .quad 0x00000007f0000007f # + .quad 0x00000007f0000007f + +.L__mask_mant_all7: .quad 0x00000000007f0000 + .quad 0x00000000007f0000 +.L__mask_mant8: .quad 0x0000000000008000 + .quad 0x0000000000008000 + +.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02 + .quad 0x03C4CCCCD3C4CCCCD + +.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375 + .quad 0x03F3170003F317000 +.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183 + .quad 0x03805FDF43805FDF4 +.L__real_half: .quad 0x03f0000003f000000 # 1/2 + .quad 0x03f0000003f000000 + +.L__real_log2_e_lead: .quad 0x03FB800003FB80000 # 1.4375000000 + .quad 0x03FB800003FB80000 +.L__real_log2_e_tail: .quad 0x03BAA3B293BAA3B29 # 0.0051950408889633 + .quad 0x03BAA3B293BAA3B29 + +.L__real_log2_e: .quad 0x3fb8aa3b3fb8aa3b + .quad 0x0000000000000000 + +.L__mask_lower: .quad 0x0ffff0000ffff0000 + .quad 0x0ffff0000ffff0000 + +.align 16 + +.L__real_neg127: .long 0x0c2fe0000 + .long 0 + .quad 0 + +.L__mask_253: .long 0x000000fd + .long 0 + .quad 0 + +.L__real_threshold: .long 0x3d800000 + .long 0 + .quad 0 + +.L__mask_01: .long 0x00000001 + .long 0 + .quad 0 + +.L__mask_80: .long 0x00000080 + .long 0 + .quad 0 + +.L__real_3b800000: .long 0x3b800000 + .long 0 + .quad 0 + +.L__real_1_over_3: .long 0x3eaaaaab + .long 0 + .quad 0 + +.L__real_1_over_2: .long 0x3f000000 + .long 0 + .quad 0 + +.align 16 +.L__log_128_lead: + .long 0x00000000 + .long 0x3c37c000 + .long 0x3cb70000 + .long 0x3d08c000 + .long 0x3d35c000 + .long 0x3d624000 + .long 0x3d874000 + .long 0x3d9d4000 + .long 0x3db30000 + .long 0x3dc8c000 + .long 0x3dde4000 + .long 0x3df38000 + .long 0x3e044000 + .long 0x3e0ec000 + .long 0x3e194000 + .long 0x3e238000 + .long 0x3e2e0000 + .long 0x3e380000 + .long 0x3e424000 + .long 0x3e4c4000 + .long 0x3e564000 + .long 0x3e604000 + .long 0x3e6a4000 + .long 0x3e740000 + .long 0x3e7dc000 + .long 0x3e83c000 + .long 0x3e888000 + .long 0x3e8d4000 + .long 0x3e920000 + .long 0x3e96c000 + .long 0x3e9b8000 + .long 0x3ea00000 + .long 0x3ea4c000 + .long 0x3ea94000 + .long 0x3eae0000 + .long 0x3eb28000 + .long 0x3eb70000 + .long 0x3ebb8000 + .long 0x3ec00000 + .long 0x3ec44000 + .long 0x3ec8c000 + .long 0x3ecd4000 + .long 0x3ed18000 + .long 0x3ed5c000 + .long 0x3eda0000 + .long 0x3ede8000 + .long 0x3ee2c000 + .long 0x3ee70000 + .long 0x3eeb0000 + .long 0x3eef4000 + .long 0x3ef38000 + .long 0x3ef78000 + .long 0x3efbc000 + .long 0x3effc000 + .long 0x3f01c000 + .long 0x3f040000 + .long 0x3f060000 + .long 0x3f080000 + .long 0x3f0a0000 + .long 0x3f0c0000 + .long 0x3f0dc000 + .long 0x3f0fc000 + .long 0x3f11c000 + .long 0x3f13c000 + .long 0x3f15c000 + .long 0x3f178000 + .long 0x3f198000 + .long 0x3f1b4000 + .long 0x3f1d4000 + .long 0x3f1f0000 + .long 0x3f210000 + .long 0x3f22c000 + .long 0x3f24c000 + .long 0x3f268000 + .long 0x3f288000 + .long 0x3f2a4000 + .long 0x3f2c0000 + .long 0x3f2dc000 + .long 0x3f2f8000 + .long 0x3f318000 + .long 0x3f334000 + .long 0x3f350000 + .long 0x3f36c000 + .long 0x3f388000 + .long 0x3f3a4000 + .long 0x3f3c0000 + .long 0x3f3dc000 + .long 0x3f3f8000 + .long 0x3f414000 + .long 0x3f42c000 + .long 0x3f448000 + .long 0x3f464000 + .long 0x3f480000 + .long 0x3f498000 + .long 0x3f4b4000 + .long 0x3f4d0000 + .long 0x3f4e8000 + .long 0x3f504000 + .long 0x3f51c000 + .long 0x3f538000 + .long 0x3f550000 + .long 0x3f56c000 + .long 0x3f584000 + .long 0x3f5a0000 + .long 0x3f5b8000 + .long 0x3f5d0000 + .long 0x3f5ec000 + .long 0x3f604000 + .long 0x3f61c000 + .long 0x3f638000 + .long 0x3f650000 + .long 0x3f668000 + .long 0x3f680000 + .long 0x3f698000 + .long 0x3f6b0000 + .long 0x3f6cc000 + .long 0x3f6e4000 + .long 0x3f6fc000 + .long 0x3f714000 + .long 0x3f72c000 + .long 0x3f744000 + .long 0x3f75c000 + .long 0x3f770000 + .long 0x3f788000 + .long 0x3f7a0000 + .long 0x3f7b8000 + .long 0x3f7d0000 + .long 0x3f7e8000 + .long 0x3f800000 + +.align 16 +.L__log_128_tail: + .long 0x00000000 + .long 0x374a16dd + .long 0x37f2d0b8 + .long 0x381a3aa2 + .long 0x37b4dd63 + .long 0x383f5721 + .long 0x384e27e8 + .long 0x380bf749 + .long 0x387dbeb2 + .long 0x37216e46 + .long 0x3684815b + .long 0x383b045f + .long 0x390b119b + .long 0x391a32ea + .long 0x38ba789e + .long 0x39553f30 + .long 0x3651cfde + .long 0x39685a9d + .long 0x39057a05 + .long 0x395ba0ef + .long 0x396bc5b6 + .long 0x3936d9bb + .long 0x38772619 + .long 0x39017ce9 + .long 0x3902d720 + .long 0x38856dd8 + .long 0x3941f6b4 + .long 0x3980b652 + .long 0x3980f561 + .long 0x39443f13 + .long 0x38926752 + .long 0x39c8c763 + .long 0x391e12f3 + .long 0x39b7bf89 + .long 0x36d1cfde + .long 0x38c7f233 + .long 0x39087367 + .long 0x38e95d3f + .long 0x38256316 + .long 0x39d38e5c + .long 0x396ea247 + .long 0x350e4788 + .long 0x395d829f + .long 0x39c30f2f + .long 0x39fd7ee7 + .long 0x3872e9e7 + .long 0x3897d694 + .long 0x3824923a + .long 0x39ea7c06 + .long 0x39a7fa88 + .long 0x391aa879 + .long 0x39dace65 + .long 0x39215a32 + .long 0x39af3350 + .long 0x3a7b5172 + .long 0x389cf27f + .long 0x3902806b + .long 0x3909d8a9 + .long 0x38c9faa1 + .long 0x37a33dca + .long 0x3a6623d2 + .long 0x3a3c7a61 + .long 0x3a083a84 + .long 0x39930161 + .long 0x35d1cfde + .long 0x3a2d0ebd + .long 0x399f1aad + .long 0x3a67ff6d + .long 0x39ecfea8 + .long 0x3a7b26f3 + .long 0x39ec1fa6 + .long 0x3a675314 + .long 0x399e12f3 + .long 0x3a2d4b66 + .long 0x370c3845 + .long 0x399ba329 + .long 0x3a1044d3 + .long 0x3a49a196 + .long 0x3a79fe83 + .long 0x3905c7aa + .long 0x39802391 + .long 0x39abe796 + .long 0x39c65a9d + .long 0x39cfa6c5 + .long 0x39c7f593 + .long 0x39af6ff7 + .long 0x39863e4d + .long 0x391910c1 + .long 0x369d5be7 + .long 0x3a541616 + .long 0x3a1ee960 + .long 0x39c38ed2 + .long 0x38e61600 + .long 0x3a4fedb4 + .long 0x39f6b4ab + .long 0x38f8d3b0 + .long 0x3a3b3faa + .long 0x399fb693 + .long 0x3a5cfe71 + .long 0x39c5740b + .long 0x3a611eb0 + .long 0x39b079c4 + .long 0x3a4824d7 + .long 0x39439a54 + .long 0x3a1291ea + .long 0x3a6d3673 + .long 0x3981c731 + .long 0x3a0da88f + .long 0x3a53945c + .long 0x3895ae91 + .long 0x3996372a + .long 0x39f9a832 + .long 0x3a27eda4 + .long 0x3a4c764f + .long 0x3a6a7c06 + .long 0x370321eb + .long 0x3899ab3f + .long 0x38f02086 + .long 0x390a1707 + .long 0x39031e44 + .long 0x38c6b362 + .long 0x382bf195 + .long 0x3a768e36 + .long 0x3a5c503b + .long 0x3a3c1179 + .long 0x3a15de1d + .long 0x39d3845d + .long 0x395f263f + .long 0x00000000 + +.align 16 +.L__log_F_inv: + .long 0x40000000 + .long 0x3ffe03f8 + .long 0x3ffc0fc1 + .long 0x3ffa232d + .long 0x3ff83e10 + .long 0x3ff6603e + .long 0x3ff4898d + .long 0x3ff2b9d6 + .long 0x3ff0f0f1 + .long 0x3fef2eb7 + .long 0x3fed7304 + .long 0x3febbdb3 + .long 0x3fea0ea1 + .long 0x3fe865ac + .long 0x3fe6c2b4 + .long 0x3fe52598 + .long 0x3fe38e39 + .long 0x3fe1fc78 + .long 0x3fe07038 + .long 0x3fdee95c + .long 0x3fdd67c9 + .long 0x3fdbeb62 + .long 0x3fda740e + .long 0x3fd901b2 + .long 0x3fd79436 + .long 0x3fd62b81 + .long 0x3fd4c77b + .long 0x3fd3680d + .long 0x3fd20d21 + .long 0x3fd0b6a0 + .long 0x3fcf6475 + .long 0x3fce168a + .long 0x3fcccccd + .long 0x3fcb8728 + .long 0x3fca4588 + .long 0x3fc907da + .long 0x3fc7ce0c + .long 0x3fc6980c + .long 0x3fc565c8 + .long 0x3fc43730 + .long 0x3fc30c31 + .long 0x3fc1e4bc + .long 0x3fc0c0c1 + .long 0x3fbfa030 + .long 0x3fbe82fa + .long 0x3fbd6910 + .long 0x3fbc5264 + .long 0x3fbb3ee7 + .long 0x3fba2e8c + .long 0x3fb92144 + .long 0x3fb81703 + .long 0x3fb70fbb + .long 0x3fb60b61 + .long 0x3fb509e7 + .long 0x3fb40b41 + .long 0x3fb30f63 + .long 0x3fb21643 + .long 0x3fb11fd4 + .long 0x3fb02c0b + .long 0x3faf3ade + .long 0x3fae4c41 + .long 0x3fad602b + .long 0x3fac7692 + .long 0x3fab8f6a + .long 0x3faaaaab + .long 0x3fa9c84a + .long 0x3fa8e83f + .long 0x3fa80a81 + .long 0x3fa72f05 + .long 0x3fa655c4 + .long 0x3fa57eb5 + .long 0x3fa4a9cf + .long 0x3fa3d70a + .long 0x3fa3065e + .long 0x3fa237c3 + .long 0x3fa16b31 + .long 0x3fa0a0a1 + .long 0x3f9fd80a + .long 0x3f9f1166 + .long 0x3f9e4cad + .long 0x3f9d89d9 + .long 0x3f9cc8e1 + .long 0x3f9c09c1 + .long 0x3f9b4c70 + .long 0x3f9a90e8 + .long 0x3f99d723 + .long 0x3f991f1a + .long 0x3f9868c8 + .long 0x3f97b426 + .long 0x3f97012e + .long 0x3f964fda + .long 0x3f95a025 + .long 0x3f94f209 + .long 0x3f944581 + .long 0x3f939a86 + .long 0x3f92f114 + .long 0x3f924925 + .long 0x3f91a2b4 + .long 0x3f90fdbc + .long 0x3f905a38 + .long 0x3f8fb824 + .long 0x3f8f177a + .long 0x3f8e7835 + .long 0x3f8dda52 + .long 0x3f8d3dcb + .long 0x3f8ca29c + .long 0x3f8c08c1 + .long 0x3f8b7034 + .long 0x3f8ad8f3 + .long 0x3f8a42f8 + .long 0x3f89ae41 + .long 0x3f891ac7 + .long 0x3f888889 + .long 0x3f87f781 + .long 0x3f8767ab + .long 0x3f86d905 + .long 0x3f864b8a + .long 0x3f85bf37 + .long 0x3f853408 + .long 0x3f84a9fa + .long 0x3f842108 + .long 0x3f839930 + .long 0x3f83126f + .long 0x3f828cc0 + .long 0x3f820821 + .long 0x3f81848e + .long 0x3f810204 + .long 0x3f808081 + .long 0x3f800000 + +
diff --git a/src/gas/logf.S b/src/gas/logf.S new file mode 100644 index 0000000..4cee0b0 --- /dev/null +++ b/src/gas/logf.S
@@ -0,0 +1,725 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# +# logf.S +# +# An implementation of the logf libm function. +# +# Prototype: +# +# float logf(float x); +# + +# +# Algorithm: +# Similar to one presnted in log.S +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(logf) +#define fname_special _logf_special@PLT + + +# local variable storage offsets +.equ p_temp, 0x0 +.equ stack_size, 0x18 + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + sub $stack_size, %rsp + + # compute exponent part + xor %eax, %eax + movdqa %xmm0, %xmm3 + movss %xmm0, %xmm4 + psrld $23, %xmm3 + movd %xmm0, %eax + psubd .L__mask_127(%rip), %xmm3 + movdqa %xmm0, %xmm2 + cvtdq2ps %xmm3, %xmm5 # xexp + + # NaN or inf + movdqa %xmm0, %xmm1 + andps .L__real_inf(%rip), %xmm1 + comiss .L__real_inf(%rip), %xmm1 + je .L__x_is_inf_or_nan + + # check for negative numbers or zero + xorps %xmm1, %xmm1 + comiss %xmm1, %xmm0 + jbe .L__x_is_zero_or_neg + + pand .L__real_mant(%rip), %xmm2 + subss .L__real_one(%rip), %xmm4 + + comiss .L__real_neg127(%rip), %xmm5 + je .L__denormal_adjust + +.L__continue_common: + + # compute the index into the log tables + mov %eax, %r9d + and .L__mask_mant_all7(%rip), %eax + and .L__mask_mant8(%rip), %r9d + shl $1, %r9d + add %r9d, %eax + mov %eax, p_temp(%rsp) + + # check e as a special case + comiss .L__real_ef(%rip), %xmm0 + je .L__logf_e + + # near one codepath + andps .L__real_notsign(%rip), %xmm4 + comiss .L__real_threshold(%rip), %xmm4 + jb .L__near_one + + # F, Y + movss p_temp(%rsp), %xmm1 + shr $16, %eax + por .L__real_half(%rip), %xmm2 + por .L__real_half(%rip), %xmm1 + lea .L__log_F_inv(%rip), %r9 + + # f = F - Y, r = f * inv + subss %xmm2, %xmm1 + mulss (%r9,%rax,4), %xmm1 + + movss %xmm1, %xmm2 + movss %xmm1, %xmm0 + + # poly + mulss .L__real_1_over_3(%rip), %xmm2 + mulss %xmm1, %xmm0 + addss .L__real_1_over_2(%rip), %xmm2 + movss .L__real_log2_tail(%rip), %xmm3 + + lea .L__log_128_tail(%rip), %r9 + lea .L__log_128_lead(%rip), %r10 + + mulss %xmm0, %xmm2 + mulss %xmm5, %xmm3 + addss %xmm2, %xmm1 + + # m*log(2) + log(G) - poly + movss .L__real_log2_lead(%rip), %xmm0 + subss %xmm1, %xmm3 # z2 + mulss %xmm5, %xmm0 + addss (%r9,%rax,4), %xmm3 # z2 + addss (%r10,%rax,4), %xmm0 # z1 + + addss %xmm3, %xmm0 + + add $stack_size, %rsp + ret + +.p2align 4,,15 +.L__logf_e: + movss .L__real_one(%rip), %xmm0 + add $stack_size, %rsp + ret + +.p2align 4,,15 +.L__near_one: + # r = x - 1.0# + movss .L__real_two(%rip), %xmm2 + subss .L__real_one(%rip), %xmm0 + + # u = r / (2.0 + r) + addss %xmm0, %xmm2 + movss %xmm0, %xmm1 + divss %xmm2, %xmm1 # u + + # correction = r * u + movss %xmm0, %xmm4 + mulss %xmm1, %xmm4 + + # u = u + u# + addss %xmm1, %xmm1 + movss %xmm1, %xmm2 + mulss %xmm2, %xmm2 # v = u^2 + + # r2 = (u * v * (ca_1 + v * ca_2) - correction) + movss %xmm1, %xmm3 + mulss %xmm2, %xmm3 # u^3 + mulss .L__real_ca2(%rip), %xmm2 # Bu^2 + addss .L__real_ca1(%rip), %xmm2 # +A + mulss %xmm3, %xmm2 + subss %xmm4, %xmm2 # -correction + + # r + r2 + addss %xmm2, %xmm0 + add $stack_size, %rsp + ret + +.p2align 4,,15 +.L__denormal_adjust: + por .L__real_one(%rip), %xmm2 + subss .L__real_one(%rip), %xmm2 + movdqa %xmm2, %xmm5 + pand .L__real_mant(%rip), %xmm2 + movd %xmm2, %eax + psrld $23, %xmm5 + psubd .L__mask_253(%rip), %xmm5 + cvtdq2ps %xmm5, %xmm5 + jmp .L__continue_common + +.p2align 4,,15 +.L__x_is_zero_or_neg: + jne .L__x_is_neg + + movss .L__real_ninf(%rip), %xmm1 + mov .L__flag_x_zero(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__x_is_neg: + + movss .L__real_nan(%rip), %xmm1 + mov .L__flag_x_neg(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__x_is_inf_or_nan: + + cmp .L__real_inf(%rip), %eax + je .L__finish + + cmp .L__real_ninf(%rip), %eax + je .L__x_is_neg + + mov .L__real_qnanbit(%rip), %r9d + and %eax, %r9d + jnz .L__finish + + or .L__real_qnanbit(%rip), %eax + movd %eax, %xmm1 + mov .L__flag_x_nan(%rip), %edi + call fname_special + jmp .L__finish + +.p2align 4,,15 +.L__finish: + add $stack_size, %rsp + ret + + +.data + +.align 16 + +# these codes and the ones in the corresponding .c file have to match +.L__flag_x_zero: .long 00000001 +.L__flag_x_neg: .long 00000002 +.L__flag_x_nan: .long 00000003 + +.align 16 + +.L__real_one: .quad 0x03f8000003f800000 # 1.0 + .quad 0x03f8000003f800000 +.L__real_two: .quad 0x04000000040000000 # 1.0 + .quad 0x04000000040000000 +.L__real_ninf: .quad 0x0ff800000ff800000 # -inf + .quad 0x0ff800000ff800000 +.L__real_inf: .quad 0x07f8000007f800000 # +inf + .quad 0x07f8000007f800000 +.L__real_nan: .quad 0x07fc000007fc00000 # NaN + .quad 0x07fc000007fc00000 +.L__real_ef: .quad 0x0402DF854402DF854 # float e + .quad 0x0402DF854402DF854 +.L__real_neg_qnan: .quad 0x0ffc00000ffc00000 + .quad 0x0ffc00000ffc00000 + +.L__real_sign: .quad 0x08000000080000000 # sign bit + .quad 0x08000000080000000 +.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit + .quad 0x07ffFFFFF7ffFFFFF +.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit + .quad 0x00040000000400000 +.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantissa bits + .quad 0x0007FFFFF007FFFFF +.L__mask_127: .quad 0x00000007f0000007f # + .quad 0x00000007f0000007f + +.L__mask_mant_all7: .quad 0x00000000007f0000 + .quad 0x00000000007f0000 +.L__mask_mant8: .quad 0x0000000000008000 + .quad 0x0000000000008000 + +.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02 + .quad 0x03C4CCCCD3C4CCCCD + +.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375 + .quad 0x03F3170003F317000 +.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183 + .quad 0x03805FDF43805FDF4 +.L__real_half: .quad 0x03f0000003f000000 # 1/2 + .quad 0x03f0000003f000000 + + +.align 16 + +.L__real_neg127: .long 0x0c2fe0000 + .long 0 + .quad 0 + +.L__mask_253: .long 0x000000fd + .long 0 + .quad 0 + +.L__real_threshold: .long 0x3d800000 + .long 0 + .quad 0 + +.L__mask_01: .long 0x00000001 + .long 0 + .quad 0 + +.L__mask_80: .long 0x00000080 + .long 0 + .quad 0 + +.L__real_3b800000: .long 0x3b800000 + .long 0 + .quad 0 + +.L__real_1_over_3: .long 0x3eaaaaab + .long 0 + .quad 0 + +.L__real_1_over_2: .long 0x3f000000 + .long 0 + .quad 0 + +.align 16 +.L__log_128_lead: + .long 0x00000000 + .long 0x3bff0000 + .long 0x3c7e0000 + .long 0x3cbdc000 + .long 0x3cfc1000 + .long 0x3d1cf000 + .long 0x3d3ba000 + .long 0x3d5a1000 + .long 0x3d785000 + .long 0x3d8b2000 + .long 0x3d9a0000 + .long 0x3da8d000 + .long 0x3db78000 + .long 0x3dc61000 + .long 0x3dd49000 + .long 0x3de2f000 + .long 0x3df13000 + .long 0x3dff6000 + .long 0x3e06b000 + .long 0x3e0db000 + .long 0x3e14a000 + .long 0x3e1b8000 + .long 0x3e226000 + .long 0x3e293000 + .long 0x3e2ff000 + .long 0x3e36b000 + .long 0x3e3d5000 + .long 0x3e43f000 + .long 0x3e4a9000 + .long 0x3e511000 + .long 0x3e579000 + .long 0x3e5e1000 + .long 0x3e647000 + .long 0x3e6ae000 + .long 0x3e713000 + .long 0x3e778000 + .long 0x3e7dc000 + .long 0x3e820000 + .long 0x3e851000 + .long 0x3e882000 + .long 0x3e8b3000 + .long 0x3e8e4000 + .long 0x3e914000 + .long 0x3e944000 + .long 0x3e974000 + .long 0x3e9a3000 + .long 0x3e9d3000 + .long 0x3ea02000 + .long 0x3ea30000 + .long 0x3ea5f000 + .long 0x3ea8d000 + .long 0x3eabb000 + .long 0x3eae8000 + .long 0x3eb16000 + .long 0x3eb43000 + .long 0x3eb70000 + .long 0x3eb9c000 + .long 0x3ebc9000 + .long 0x3ebf5000 + .long 0x3ec21000 + .long 0x3ec4d000 + .long 0x3ec78000 + .long 0x3eca3000 + .long 0x3ecce000 + .long 0x3ecf9000 + .long 0x3ed24000 + .long 0x3ed4e000 + .long 0x3ed78000 + .long 0x3eda2000 + .long 0x3edcc000 + .long 0x3edf5000 + .long 0x3ee1e000 + .long 0x3ee47000 + .long 0x3ee70000 + .long 0x3ee99000 + .long 0x3eec1000 + .long 0x3eeea000 + .long 0x3ef12000 + .long 0x3ef3a000 + .long 0x3ef61000 + .long 0x3ef89000 + .long 0x3efb0000 + .long 0x3efd7000 + .long 0x3effe000 + .long 0x3f012000 + .long 0x3f025000 + .long 0x3f039000 + .long 0x3f04c000 + .long 0x3f05f000 + .long 0x3f072000 + .long 0x3f084000 + .long 0x3f097000 + .long 0x3f0aa000 + .long 0x3f0bc000 + .long 0x3f0cf000 + .long 0x3f0e1000 + .long 0x3f0f4000 + .long 0x3f106000 + .long 0x3f118000 + .long 0x3f12a000 + .long 0x3f13c000 + .long 0x3f14e000 + .long 0x3f160000 + .long 0x3f172000 + .long 0x3f183000 + .long 0x3f195000 + .long 0x3f1a7000 + .long 0x3f1b8000 + .long 0x3f1c9000 + .long 0x3f1db000 + .long 0x3f1ec000 + .long 0x3f1fd000 + .long 0x3f20e000 + .long 0x3f21f000 + .long 0x3f230000 + .long 0x3f241000 + .long 0x3f252000 + .long 0x3f263000 + .long 0x3f273000 + .long 0x3f284000 + .long 0x3f295000 + .long 0x3f2a5000 + .long 0x3f2b5000 + .long 0x3f2c6000 + .long 0x3f2d6000 + .long 0x3f2e6000 + .long 0x3f2f7000 + .long 0x3f307000 + .long 0x3f317000 + +.align 16 +.L__log_128_tail: + .long 0x00000000 + .long 0x3429ac41 + .long 0x35a8b0fc + .long 0x368d83ea + .long 0x361b0e78 + .long 0x3687b9fe + .long 0x3631ec65 + .long 0x36dd7119 + .long 0x35c30045 + .long 0x379b7751 + .long 0x37ebcb0d + .long 0x37839f83 + .long 0x37528ae5 + .long 0x37a2eb18 + .long 0x36da7495 + .long 0x36a91eb7 + .long 0x3783b715 + .long 0x371131db + .long 0x383f3e68 + .long 0x38156a97 + .long 0x38297c0f + .long 0x387e100f + .long 0x3815b665 + .long 0x37e5e3a1 + .long 0x38183853 + .long 0x35fe719d + .long 0x38448108 + .long 0x38503290 + .long 0x373539e8 + .long 0x385e0ff1 + .long 0x3864a740 + .long 0x3786742d + .long 0x387be3cd + .long 0x3685ad3e + .long 0x3803b715 + .long 0x37adcbdc + .long 0x380c36af + .long 0x371652d3 + .long 0x38927139 + .long 0x38c5fcd7 + .long 0x38ae55d5 + .long 0x3818c169 + .long 0x38a0fde7 + .long 0x38ad09ef + .long 0x3862bae1 + .long 0x38eecd4c + .long 0x3798aad2 + .long 0x37421a1a + .long 0x38c5e10e + .long 0x37bf2aee + .long 0x382d872d + .long 0x37ee2e8a + .long 0x38dedfac + .long 0x3802f2b9 + .long 0x38481e9b + .long 0x380eaa2b + .long 0x38ebfb5d + .long 0x38255fdd + .long 0x38783b82 + .long 0x3851da1e + .long 0x374e1b05 + .long 0x388f439b + .long 0x38ca0e10 + .long 0x38cac08b + .long 0x3891f65f + .long 0x378121cb + .long 0x386c9a9a + .long 0x38949923 + .long 0x38777bcc + .long 0x37b12d26 + .long 0x38a6ced3 + .long 0x38ebd3e6 + .long 0x38fbe3cd + .long 0x38d785c2 + .long 0x387e7e00 + .long 0x38f392c5 + .long 0x37d40983 + .long 0x38081a7c + .long 0x3784c3ad + .long 0x38cce923 + .long 0x380f5faf + .long 0x3891fd38 + .long 0x38ac47bc + .long 0x3897042b + .long 0x392952d2 + .long 0x396fced4 + .long 0x37f97073 + .long 0x385e9eae + .long 0x3865c84a + .long 0x38130ba3 + .long 0x3979cf16 + .long 0x3938cac9 + .long 0x38c3d2f4 + .long 0x39755dec + .long 0x38e6b467 + .long 0x395c0fb8 + .long 0x383ebce0 + .long 0x38dcd192 + .long 0x39186bdf + .long 0x392de74c + .long 0x392f0944 + .long 0x391bff61 + .long 0x38e9ed44 + .long 0x38686dc8 + .long 0x396b99a7 + .long 0x39099c89 + .long 0x37a27673 + .long 0x390bdaa3 + .long 0x397069ab + .long 0x388449ff + .long 0x39013538 + .long 0x392dc268 + .long 0x3947f423 + .long 0x394ff17c + .long 0x3945e10e + .long 0x3929e8f5 + .long 0x38f85db0 + .long 0x38735f99 + .long 0x396c08db + .long 0x3909e600 + .long 0x37b4996f + .long 0x391233cc + .long 0x397cead9 + .long 0x38adb5cd + .long 0x3920261a + .long 0x3958ee36 + .long 0x35aa4905 + .long 0x37cbd11e + .long 0x3805fdf4 + +.align 16 +.L__log_F_inv: + .long 0x40000000 + .long 0x3ffe03f8 + .long 0x3ffc0fc1 + .long 0x3ffa232d + .long 0x3ff83e10 + .long 0x3ff6603e + .long 0x3ff4898d + .long 0x3ff2b9d6 + .long 0x3ff0f0f1 + .long 0x3fef2eb7 + .long 0x3fed7304 + .long 0x3febbdb3 + .long 0x3fea0ea1 + .long 0x3fe865ac + .long 0x3fe6c2b4 + .long 0x3fe52598 + .long 0x3fe38e39 + .long 0x3fe1fc78 + .long 0x3fe07038 + .long 0x3fdee95c + .long 0x3fdd67c9 + .long 0x3fdbeb62 + .long 0x3fda740e + .long 0x3fd901b2 + .long 0x3fd79436 + .long 0x3fd62b81 + .long 0x3fd4c77b + .long 0x3fd3680d + .long 0x3fd20d21 + .long 0x3fd0b6a0 + .long 0x3fcf6475 + .long 0x3fce168a + .long 0x3fcccccd + .long 0x3fcb8728 + .long 0x3fca4588 + .long 0x3fc907da + .long 0x3fc7ce0c + .long 0x3fc6980c + .long 0x3fc565c8 + .long 0x3fc43730 + .long 0x3fc30c31 + .long 0x3fc1e4bc + .long 0x3fc0c0c1 + .long 0x3fbfa030 + .long 0x3fbe82fa + .long 0x3fbd6910 + .long 0x3fbc5264 + .long 0x3fbb3ee7 + .long 0x3fba2e8c + .long 0x3fb92144 + .long 0x3fb81703 + .long 0x3fb70fbb + .long 0x3fb60b61 + .long 0x3fb509e7 + .long 0x3fb40b41 + .long 0x3fb30f63 + .long 0x3fb21643 + .long 0x3fb11fd4 + .long 0x3fb02c0b + .long 0x3faf3ade + .long 0x3fae4c41 + .long 0x3fad602b + .long 0x3fac7692 + .long 0x3fab8f6a + .long 0x3faaaaab + .long 0x3fa9c84a + .long 0x3fa8e83f + .long 0x3fa80a81 + .long 0x3fa72f05 + .long 0x3fa655c4 + .long 0x3fa57eb5 + .long 0x3fa4a9cf + .long 0x3fa3d70a + .long 0x3fa3065e + .long 0x3fa237c3 + .long 0x3fa16b31 + .long 0x3fa0a0a1 + .long 0x3f9fd80a + .long 0x3f9f1166 + .long 0x3f9e4cad + .long 0x3f9d89d9 + .long 0x3f9cc8e1 + .long 0x3f9c09c1 + .long 0x3f9b4c70 + .long 0x3f9a90e8 + .long 0x3f99d723 + .long 0x3f991f1a + .long 0x3f9868c8 + .long 0x3f97b426 + .long 0x3f97012e + .long 0x3f964fda + .long 0x3f95a025 + .long 0x3f94f209 + .long 0x3f944581 + .long 0x3f939a86 + .long 0x3f92f114 + .long 0x3f924925 + .long 0x3f91a2b4 + .long 0x3f90fdbc + .long 0x3f905a38 + .long 0x3f8fb824 + .long 0x3f8f177a + .long 0x3f8e7835 + .long 0x3f8dda52 + .long 0x3f8d3dcb + .long 0x3f8ca29c + .long 0x3f8c08c1 + .long 0x3f8b7034 + .long 0x3f8ad8f3 + .long 0x3f8a42f8 + .long 0x3f89ae41 + .long 0x3f891ac7 + .long 0x3f888889 + .long 0x3f87f781 + .long 0x3f8767ab + .long 0x3f86d905 + .long 0x3f864b8a + .long 0x3f85bf37 + .long 0x3f853408 + .long 0x3f84a9fa + .long 0x3f842108 + .long 0x3f839930 + .long 0x3f83126f + .long 0x3f828cc0 + .long 0x3f820821 + .long 0x3f81848e + .long 0x3f810204 + .long 0x3f808081 + .long 0x3f800000 + +
diff --git a/src/gas/nearbyint.S b/src/gas/nearbyint.S new file mode 100644 index 0000000..edb1549 --- /dev/null +++ b/src/gas/nearbyint.S
@@ -0,0 +1,98 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# fabs.S +# +# An implementation of the fabs libm function. +# +# Prototype: +# +# double fabs(double x); +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(nearbyint) +#define fname_special _nearbyint_special + + +# local variable storage offsets + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + movsd .L__2p52_mask_64(%rip),%xmm2 + movsd .L__sign_mask_64(%rip),%xmm4 + movsd %xmm4,%xmm6 + movsd %xmm0,%xmm1 # move input to xmm register's xmm1 and xmm5 + movsd %xmm0,%xmm5 + pand %xmm4,%xmm1 # xmm1 = abs(xmm1) + movsd %xmm1,%xmm3 # move xmm1 to xmm3 + comisd %xmm2,%xmm1 # + jnc .L__greater_than_2p52 # + jp .L__is_infinity_nan # parity flag is raised if one of the xmm2 or + # xmm1 is Nan +.L__normal_input_case: + #sign.u32 = checkbits.u32[1] & 0x80000000; + #xmm4 = sign.u32 + pandn %xmm5,%xmm4 + #val_2p52.u32[1] = sign.u32 | 0x43300000; + #val_2p52.u32[0] = 0; + por %xmm4,%xmm2 + #val_2p52.f64 = (x + val_2p52.f64) - val_2p52.f64; + addpd %xmm2,%xmm5 + subpd %xmm5,%xmm2 + #val_2p52.u32[1] = ((val_2p52.u32[1] << 1) >> 1) | sign.u32; + pand %xmm6,%xmm2 + por %xmm4,%xmm2 + movsd %xmm2,%xmm0 # move the result to xmm0 register + ret +.L__special_case: +.L__greater_than_2p52: + ret # result is present in xmm0 +.L__is_infinity_nan: + addpd %xmm0,%xmm0 + ret +.align 16 +.L__sign_mask_64: .quad 0x7FFFFFFFFFFFFFFF + .quad 0 +.L__2p52_mask_64: .quad 0x4330000000000000 + .quad 0 +.L__exp_mask_64: .quad 0x7FF0000000000000 + .quad 0 + + + + + +
diff --git a/src/gas/pow.S b/src/gas/pow.S new file mode 100644 index 0000000..8028b83 --- /dev/null +++ b/src/gas/pow.S
@@ -0,0 +1,2244 @@ +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +#ifdef __x86_64__ +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# pow.S +# +# An implementation of the pow libm function. +# +# Prototype: +# +# double pow(double x, double y); +# + +# +# Algorithm: +# x^y = e^(y*ln(x)) +# +# Look in exp, log for the respective algorithms +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(pow) +#define fname_special _pow_special@PLT + + +# local variable storage offsets +.equ save_x, 0x0 +.equ save_y, 0x10 +.equ p_temp_exp, 0x20 +.equ negate_result, 0x30 +.equ save_ax, 0x40 +.equ y_head, 0x50 +.equ p_temp_log, 0x60 +.equ stack_size, 0x78 + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + sub $stack_size, %rsp + + movsd %xmm0, save_x(%rsp) + movsd %xmm1, save_y(%rsp) + + mov save_x(%rsp), %rdx + mov save_y(%rsp), %r8 + + mov .L__exp_mant_mask(%rip), %r10 + and %r8, %r10 + jz .L__y_is_zero + + cmp .L__pos_one(%rip), %r8 + je .L__y_is_one + + mov .L__sign_mask(%rip), %r9 + and %rdx, %r9 + cmp .L__sign_mask(%rip), %r9 + mov .L__pos_zero(%rip), %rax + mov %rax, negate_result(%rsp) + je .L__x_is_neg + + cmp .L__pos_one(%rip), %rdx + je .L__x_is_pos_one + + cmp .L__pos_zero(%rip), %rdx + je .L__x_is_zero + + mov .L__exp_mask(%rip), %r9 + and %rdx, %r9 + cmp .L__exp_mask(%rip), %r9 + je .L__x_is_inf_or_nan + + mov .L__exp_mask(%rip), %r10 + and %r8, %r10 + cmp .L__ay_max_bound(%rip), %r10 + jg .L__ay_is_very_large + + mov .L__exp_mask(%rip), %r10 + and %r8, %r10 + cmp .L__ay_min_bound(%rip), %r10 + jl .L__ay_is_very_small + + # ----------------------------- + # compute log(x) here + # ----------------------------- +.L__log_x: + + # compute exponent part + xor %r8, %r8 + movdqa %xmm0, %xmm3 + psrlq $52, %xmm3 + movd %xmm0, %r8 + psubq .L__mask_1023(%rip), %xmm3 + movdqa %xmm0, %xmm2 + cvtdq2pd %xmm3, %xmm6 # xexp + pand .L__real_mant(%rip), %xmm2 + + comisd .L__mask_1023_f(%rip), %xmm6 + je .L__denormal_adjust + +.L__continue_common: + + # compute index into the log tables + movsd %xmm0, %xmm7 + mov %r8, %r9 + and .L__mask_mant_all8(%rip), %r8 + and .L__mask_mant9(%rip), %r9 + subsd .L__real_one(%rip), %xmm7 + shl %r9 + add %r9, %r8 + mov %r8, p_temp_log(%rsp) + andpd .L__real_notsign(%rip), %xmm7 + + # F, Y, switch to near-one codepath + movsd p_temp_log(%rsp), %xmm1 + shr $44, %r8 + por .L__real_half(%rip), %xmm2 + por .L__real_half(%rip), %xmm1 + comisd .L__real_threshold(%rip), %xmm7 + lea .L__log_F_inv_head(%rip), %r9 + lea .L__log_F_inv_tail(%rip), %rdx + jb .L__near_one + + # f = F - Y, r = f * inv + subsd %xmm2, %xmm1 + movsd %xmm1, %xmm4 + mulsd (%r9,%r8,8), %xmm1 + movsd %xmm1, %xmm5 + mulsd (%rdx,%r8,8), %xmm4 + movsd %xmm4, %xmm7 + addsd %xmm4, %xmm1 + + movsd %xmm1, %xmm2 + movsd %xmm1, %xmm0 + lea .L__log_256_lead(%rip), %r9 + + # poly + movsd .L__real_1_over_6(%rip), %xmm3 + movsd .L__real_1_over_3(%rip), %xmm1 + mulsd %xmm2, %xmm3 + mulsd %xmm2, %xmm1 + mulsd %xmm2, %xmm0 + subsd %xmm2, %xmm5 + movsd %xmm0, %xmm4 + addsd .L__real_1_over_5(%rip), %xmm3 + addsd .L__real_1_over_2(%rip), %xmm1 + mulsd %xmm0, %xmm4 + mulsd %xmm2, %xmm3 + mulsd %xmm0, %xmm1 + addsd .L__real_1_over_4(%rip), %xmm3 + addsd %xmm5, %xmm7 + mulsd %xmm4, %xmm3 + addsd %xmm3, %xmm1 + addsd %xmm7, %xmm1 + + movsd .L__real_log2_tail(%rip), %xmm5 + lea .L__log_256_tail(%rip), %rdx + mulsd %xmm6, %xmm5 + movsd (%r9,%r8,8), %xmm0 + subsd %xmm1, %xmm5 + + movsd (%rdx,%r8,8), %xmm3 + addsd %xmm5, %xmm3 + movsd %xmm3, %xmm1 + subsd %xmm2, %xmm3 + + movsd .L__real_log2_lead(%rip), %xmm7 + mulsd %xmm6, %xmm7 + addsd %xmm7, %xmm0 + + # result of ln(x) is computed from head and tail parts, resH and resT + # res = ln(x) = resH + resT + # resH and resT are in full precision + + # resT is computed from head and tail parts, resT_h and resT_t + # resT = resT_h + resT_t + + # now + # xmm3 - resT + # xmm0 - resH + # xmm1 - (resT_t) + # xmm2 - (-resT_h) + +.L__log_x_continue: + + movsd %xmm0, %xmm7 + addsd %xmm3, %xmm0 + movsd %xmm0, %xmm5 + andpd .L__real_fffffffff8000000(%rip), %xmm0 + + # xmm0 - H + # xmm7 - resH + # xmm5 - res + + mov save_y(%rsp), %rax + and .L__real_fffffffff8000000(%rip), %rax + + addsd %xmm3, %xmm2 + subsd %xmm5, %xmm7 + subsd %xmm2, %xmm1 + addsd %xmm3, %xmm7 + subsd %xmm0, %xmm5 + + mov %rax, y_head(%rsp) + movsd save_y(%rsp), %xmm4 + + addsd %xmm1, %xmm7 + addsd %xmm5, %xmm7 + + # res = H + T + # H has leading 26 bits of precision + # T has full precision + + # xmm0 - H + # xmm7 - T + + movsd y_head(%rsp), %xmm2 + subsd %xmm2, %xmm4 + + # y is split into head and tail + # for y * ln(x) computation + + # xmm4 - Yt + # xmm2 - Yh + # xmm0 - H + # xmm7 - T + + movsd %xmm4, %xmm3 + movsd %xmm7, %xmm5 + movsd %xmm0, %xmm6 + mulsd %xmm7, %xmm3 # YtRt + mulsd %xmm0, %xmm4 # YtRh + mulsd %xmm2, %xmm5 # YhRt + mulsd %xmm2, %xmm6 # YhRh + + movsd %xmm6, %xmm1 + addsd %xmm4, %xmm3 + addsd %xmm5, %xmm3 + + addsd %xmm3, %xmm1 + movsd %xmm1, %xmm0 + + subsd %xmm1, %xmm6 + addsd %xmm3, %xmm6 + + # y * ln(x) = v + vt + # v and vt are in full precision + + # xmm0 - v + # xmm6 - vt + + # ----------------------------- + # compute exp( y * ln(x) ) here + # ----------------------------- + + # v * (64/ln(2)) + movsd .L__real_64_by_log2(%rip), %xmm7 + movsd %xmm0, p_temp_exp(%rsp) + mulsd %xmm0, %xmm7 + mov p_temp_exp(%rsp), %rdx + + # v < 1024*ln(2), ( v * (64/ln(2)) ) < 64*1024 + # v >= -1075*ln(2), ( v * (64/ln(2)) ) >= 64*(-1075) + comisd .L__real_p65536(%rip), %xmm7 + ja .L__process_result_inf + + comisd .L__real_m68800(%rip), %xmm7 + jb .L__process_result_zero + + # n = int( v * (64/ln(2)) ) + cvtpd2dq %xmm7, %xmm4 + lea .L__two_to_jby64_head_table(%rip), %r10 + lea .L__two_to_jby64_tail_table(%rip), %r11 + cvtdq2pd %xmm4, %xmm1 + + # r1 = x - n * ln(2)/64 head + movsd .L__real_log2_by_64_head(%rip), %xmm2 + mulsd %xmm1, %xmm2 + movd %xmm4, %ecx + mov $0x3f, %rax + and %ecx, %eax + subsd %xmm2, %xmm0 + + # r2 = - n * ln(2)/64 tail + mulsd .L__real_log2_by_64_tail(%rip), %xmm1 + movsd %xmm0, %xmm2 + + # m = (n - j) / 64 + sub %eax, %ecx + sar $6, %ecx + + # r1+r2 + addsd %xmm1, %xmm2 + addsd %xmm6, %xmm2 # add vt here + movsd %xmm2, %xmm1 + + # q + movsd .L__real_1_by_2(%rip), %xmm0 + movsd .L__real_1_by_24(%rip), %xmm3 + movsd .L__real_1_by_720(%rip), %xmm4 + mulsd %xmm2, %xmm1 + mulsd %xmm2, %xmm0 + mulsd %xmm2, %xmm3 + mulsd %xmm2, %xmm4 + + movsd %xmm1, %xmm5 + mulsd %xmm2, %xmm1 + addsd .L__real_one(%rip), %xmm0 + addsd .L__real_1_by_6(%rip), %xmm3 + mulsd %xmm1, %xmm5 + addsd .L__real_1_by_120(%rip), %xmm4 + mulsd %xmm2, %xmm0 + mulsd %xmm1, %xmm3 + + mulsd %xmm5, %xmm4 + + # deal with denormal results + xor %r9d, %r9d + cmp .L__denormal_threshold(%rip), %ecx + + addsd %xmm4, %xmm3 + addsd %xmm3, %xmm0 + + cmovle %ecx, %r9d + add $1023, %rcx + shl $52, %rcx + + # f1, f2 + movsd (%r11,%rax,8), %xmm5 + movsd (%r10,%rax,8), %xmm1 + mulsd %xmm0, %xmm5 + mulsd %xmm0, %xmm1 + + cmp .L__real_inf(%rip), %rcx + + # (f1+f2)*(1+q) + addsd (%r11,%rax,8), %xmm5 + addsd %xmm5, %xmm1 + addsd (%r10,%rax,8), %xmm1 + movsd %xmm1, %xmm0 + + je .L__process_almost_inf + + test %r9d, %r9d + mov %rcx, p_temp_exp(%rsp) + jnz .L__process_denormal + mulsd p_temp_exp(%rsp), %xmm0 + orpd negate_result(%rsp), %xmm0 + +.L__final_check: + add $stack_size, %rsp + ret + +.p2align 4,,15 +.L__process_almost_inf: + comisd .L__real_one(%rip), %xmm0 + jae .L__process_result_inf + + orpd .L__enable_almost_inf(%rip), %xmm0 + orpd negate_result(%rsp), %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__process_denormal: + mov %r9d, %ecx + xor %r11d, %r11d + comisd .L__real_one(%rip), %xmm0 + cmovae %ecx, %r11d + cmp .L__denormal_threshold(%rip), %r11d + jne .L__process_true_denormal + + mulsd p_temp_exp(%rsp), %xmm0 + orpd negate_result(%rsp), %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__process_true_denormal: + xor %r8, %r8 + cmp .L__denormal_tiny_threshold(%rip), %rdx + mov $1, %r9 + jg .L__process_denormal_tiny + add $1074, %ecx + cmovs %r8, %rcx + shl %cl, %r9 + mov %r9, %rcx + + mov %rcx, p_temp_exp(%rsp) + mulsd p_temp_exp(%rsp), %xmm0 + orpd negate_result(%rsp), %xmm0 + jmp .L__z_denormal + +.p2align 4,,15 +.L__process_denormal_tiny: + movsd .L__real_smallest_denormal(%rip), %xmm0 + orpd negate_result(%rsp), %xmm0 + jmp .L__z_denormal + +.p2align 4,,15 +.L__process_result_zero: + mov .L__real_zero(%rip), %r11 + or negate_result(%rsp), %r11 + jmp .L__z_is_zero_or_inf + +.p2align 4,,15 +.L__process_result_inf: + mov .L__real_inf(%rip), %r11 + or negate_result(%rsp), %r11 + jmp .L__z_is_zero_or_inf + +.p2align 4,,15 +.L__denormal_adjust: + por .L__real_one(%rip), %xmm2 + subsd .L__real_one(%rip), %xmm2 + movsd %xmm2, %xmm5 + pand .L__real_mant(%rip), %xmm2 + movd %xmm2, %r8 + psrlq $52, %xmm5 + psubd .L__mask_2045(%rip), %xmm5 + cvtdq2pd %xmm5, %xmm6 + jmp .L__continue_common + +.p2align 4,,15 +.L__x_is_neg: + + mov .L__exp_mask(%rip), %r10 + and %r8, %r10 + cmp .L__ay_max_bound(%rip), %r10 + jg .L__ay_is_very_large + + # determine if y is an integer + mov .L__exp_mant_mask(%rip), %r10 + and %r8, %r10 + mov %r10, %r11 + mov .L__exp_shift(%rip), %rcx + shr %cl, %r10 + sub .L__exp_bias(%rip), %r10 + js .L__x_is_neg_y_is_not_int + + mov .L__exp_mant_mask(%rip), %rax + and %rdx, %rax + mov %rax, save_ax(%rsp) + + cmp .L__yexp_53(%rip), %r10 + mov %r10, %rcx + jg .L__continue_after_y_int_check + + mov .L__mant_full(%rip), %r9 + shr %cl, %r9 + and %r11, %r9 + jnz .L__x_is_neg_y_is_not_int + + mov .L__1_before_mant(%rip), %r9 + shr %cl, %r9 + and %r11, %r9 + jz .L__continue_after_y_int_check + + mov .L__sign_mask(%rip), %rax + mov %rax, negate_result(%rsp) + +.L__continue_after_y_int_check: + + cmp .L__neg_zero(%rip), %rdx + je .L__x_is_zero + + cmp .L__neg_one(%rip), %rdx + je .L__x_is_neg_one + + mov .L__exp_mask(%rip), %r9 + and %rdx, %r9 + cmp .L__exp_mask(%rip), %r9 + je .L__x_is_inf_or_nan + + movsd save_ax(%rsp), %xmm0 + jmp .L__log_x + + +.p2align 4,,15 +.L__near_one: + + # f = F - Y, r = f * inv + movsd %xmm1, %xmm0 + subsd %xmm2, %xmm1 + movsd %xmm1, %xmm4 + + movsd (%r9,%r8,8), %xmm3 + addsd (%rdx,%r8,8), %xmm3 + mulsd %xmm3, %xmm4 + andpd .L__real_fffffffff8000000(%rip), %xmm4 + movsd %xmm4, %xmm5 # r1 + mulsd %xmm0, %xmm4 + subsd %xmm4, %xmm1 + mulsd %xmm3, %xmm1 + movsd %xmm1, %xmm7 # r2 + addsd %xmm5, %xmm1 + + movsd %xmm1, %xmm2 + movsd %xmm1, %xmm0 + + lea .L__log_256_lead(%rip), %r9 + + # poly + movsd .L__real_1_over_7(%rip), %xmm3 + movsd .L__real_1_over_4(%rip), %xmm1 + mulsd %xmm2, %xmm3 + mulsd %xmm2, %xmm1 + mulsd %xmm2, %xmm0 + movsd %xmm0, %xmm4 + addsd .L__real_1_over_6(%rip), %xmm3 + addsd .L__real_1_over_3(%rip), %xmm1 + mulsd %xmm0, %xmm4 + mulsd %xmm2, %xmm3 + mulsd %xmm2, %xmm1 + addsd .L__real_1_over_5(%rip), %xmm3 + mulsd %xmm2, %xmm3 + mulsd %xmm0, %xmm1 + mulsd %xmm4, %xmm3 + + movsd %xmm5, %xmm2 + movsd %xmm7, %xmm0 + mulsd %xmm0, %xmm0 + mulsd .L__real_1_over_2(%rip), %xmm0 + mulsd %xmm7, %xmm5 + addsd %xmm0, %xmm5 + addsd %xmm7, %xmm5 + + movsd %xmm2, %xmm0 + movsd %xmm2, %xmm7 + mulsd %xmm0, %xmm0 + mulsd .L__real_1_over_2(%rip), %xmm0 + movsd %xmm0, %xmm4 + addsd %xmm0, %xmm2 # r1 + r1^2/2 + subsd %xmm2, %xmm7 + addsd %xmm4, %xmm7 + + addsd %xmm7, %xmm3 + movsd .L__real_log2_tail(%rip), %xmm4 + addsd %xmm3, %xmm1 + mulsd %xmm6, %xmm4 + lea .L__log_256_tail(%rip), %rdx + addsd %xmm5, %xmm1 + addsd (%rdx,%r8,8), %xmm4 + subsd %xmm1, %xmm4 + + movsd %xmm4, %xmm3 + movsd %xmm4, %xmm1 + subsd %xmm2, %xmm3 + + movsd (%r9,%r8,8), %xmm0 + movsd .L__real_log2_lead(%rip), %xmm7 + mulsd %xmm6, %xmm7 + addsd %xmm7, %xmm0 + + jmp .L__log_x_continue + + +.p2align 4,,15 +.L__x_is_pos_one: + xor %rax, %rax + mov .L__exp_mask(%rip), %r10 + and %r8, %r10 + cmp .L__exp_mask(%rip), %r10 + cmove %r8, %rax + mov .L__mant_mask(%rip), %r10 + and %rax, %r10 + jz .L__final_check + + mov .L__qnan_set(%rip), %r10 + and %r8, %r10 + jnz .L__final_check + + movsd save_x(%rsp), %xmm0 + movsd save_y(%rsp), %xmm1 + movsd .L__pos_one(%rip), %xmm2 + mov .L__flag_x_one_y_snan(%rip), %edi + + call fname_special + jmp .L__final_check + +.p2align 4,,15 +.L__y_is_zero: + xor %rax, %rax + mov .L__exp_mask(%rip), %r9 + mov .L__real_one(%rip), %r11 + and %rdx, %r9 + cmp .L__exp_mask(%rip), %r9 + cmove %rdx, %rax + mov .L__mant_mask(%rip), %r9 + and %rax, %r9 + jnz .L__x_is_nan + + movsd .L__real_one(%rip), %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__y_is_one: + xor %rax, %rax + mov %rdx, %r11 + mov .L__exp_mask(%rip), %r9 + or .L__qnan_set(%rip), %r11 + and %rdx, %r9 + cmp .L__exp_mask(%rip), %r9 + cmove %rdx, %rax + mov .L__mant_mask(%rip), %r9 + and %rax, %r9 + jnz .L__x_is_nan + + movd %rdx, %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_neg_one: + mov .L__pos_one(%rip), %rdx + or negate_result(%rsp), %rdx + xor %rax, %rax + mov %r8, %r11 + mov .L__exp_mask(%rip), %r10 + or .L__qnan_set(%rip), %r11 + and %r8, %r10 + cmp .L__exp_mask(%rip), %r10 + cmove %r8, %rax + mov .L__mant_mask(%rip), %r10 + and %rax, %r10 + jnz .L__y_is_nan + + movd %rdx, %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_neg_y_is_not_int: + mov .L__exp_mask(%rip), %r9 + and %rdx, %r9 + cmp .L__exp_mask(%rip), %r9 + je .L__x_is_inf_or_nan + + cmp .L__neg_zero(%rip), %rdx + je .L__x_is_zero + + movsd save_x(%rsp), %xmm0 + movsd save_y(%rsp), %xmm1 + movsd .L__qnan(%rip), %xmm2 + mov .L__flag_x_neg_y_notint(%rip), %edi + + call fname_special + jmp .L__final_check + +.p2align 4,,15 +.L__ay_is_very_large: + mov .L__exp_mask(%rip), %r9 + and %rdx, %r9 + cmp .L__exp_mask(%rip), %r9 + je .L__x_is_inf_or_nan + + mov .L__exp_mant_mask(%rip), %r9 + and %rdx, %r9 + jz .L__x_is_zero + + cmp .L__neg_one(%rip), %rdx + je .L__x_is_neg_one + + mov %rdx, %r9 + and .L__exp_mant_mask(%rip), %r9 + cmp .L__pos_one(%rip), %r9 + jl .L__ax_lt1_y_is_large_or_inf_or_nan + + jmp .L__ax_gt1_y_is_large_or_inf_or_nan + +.p2align 4,,15 +.L__x_is_zero: + mov .L__exp_mask(%rip), %r10 + xor %rax, %rax + and %r8, %r10 + cmp .L__exp_mask(%rip), %r10 + je .L__x_is_zero_y_is_inf_or_nan + + mov .L__sign_mask(%rip), %r10 + and %r8, %r10 + cmovnz .L__pos_inf(%rip), %rax + jnz .L__x_is_zero_z_is_inf + + movd %rax, %xmm0 + orpd negate_result(%rsp), %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_zero_z_is_inf: + + movsd save_x(%rsp), %xmm0 + movsd save_y(%rsp), %xmm1 + movd %rax, %xmm2 + orpd negate_result(%rsp), %xmm2 + mov .L__flag_x_zero_z_inf(%rip), %edi + + call fname_special + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_zero_y_is_inf_or_nan: + mov %r8, %r11 + cmp .L__neg_inf(%rip), %r8 + cmove .L__pos_inf(%rip), %rax + je .L__x_is_zero_z_is_inf + + or .L__qnan_set(%rip), %r11 + mov .L__mant_mask(%rip), %r10 + and %r8, %r10 + jnz .L__y_is_nan + + movd %rax, %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_inf_or_nan: + xor %r11, %r11 + mov .L__sign_mask(%rip), %r10 + and %r8, %r10 + cmovz .L__pos_inf(%rip), %r11 + mov %rdx, %rax + mov .L__mant_mask(%rip), %r9 + or .L__qnan_set(%rip), %rax + and %rdx, %r9 + cmovnz %rax, %r11 + jnz .L__x_is_nan + + xor %rax, %rax + mov %r8, %r9 + mov .L__exp_mask(%rip), %r10 + or .L__qnan_set(%rip), %r9 + and %r8, %r10 + cmp .L__exp_mask(%rip), %r10 + cmove %r8, %rax + mov .L__mant_mask(%rip), %r10 + and %rax, %r10 + cmovnz %r9, %r11 + jnz .L__y_is_nan + + movd %r11, %xmm0 + orpd negate_result(%rsp), %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__ay_is_very_small: + movsd .L__pos_one(%rip), %xmm0 + addsd %xmm1, %xmm0 + jmp .L__final_check + + +.p2align 4,,15 +.L__ax_lt1_y_is_large_or_inf_or_nan: + xor %r11, %r11 + mov .L__sign_mask(%rip), %r10 + and %r8, %r10 + cmovnz .L__pos_inf(%rip), %r11 + jmp .L__adjust_for_nan + +.p2align 4,,15 +.L__ax_gt1_y_is_large_or_inf_or_nan: + xor %r11, %r11 + mov .L__sign_mask(%rip), %r10 + and %r8, %r10 + cmovz .L__pos_inf(%rip), %r11 + +.p2align 4,,15 +.L__adjust_for_nan: + + xor %rax, %rax + mov %r8, %r9 + mov .L__exp_mask(%rip), %r10 + or .L__qnan_set(%rip), %r9 + and %r8, %r10 + cmp .L__exp_mask(%rip), %r10 + cmove %r8, %rax + mov .L__mant_mask(%rip), %r10 + and %rax, %r10 + cmovnz %r9, %r11 + jnz .L__y_is_nan + + test %rax, %rax + jnz .L__y_is_inf + +.p2align 4,,15 +.L__z_is_zero_or_inf: + + mov .L__flag_z_zero(%rip), %edi + test %r11, %r11 + cmovnz .L__flag_z_inf(%rip), %edi + + movsd save_x(%rsp), %xmm0 + movsd save_y(%rsp), %xmm1 + movd %r11, %xmm2 + + call fname_special + jmp .L__final_check + +.p2align 4,,15 +.L__y_is_inf: + + movd %r11, %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_nan: + + xor %rax, %rax + mov .L__exp_mask(%rip), %r10 + and %r8, %r10 + cmp .L__exp_mask(%rip), %r10 + cmove %r8, %rax + mov .L__mant_mask(%rip), %r10 + and %rax, %r10 + jnz .L__x_is_nan_y_is_nan + + mov .L__qnan_set(%rip), %r9 + and %rdx, %r9 + movd %r11, %xmm0 + jnz .L__final_check + + movsd save_x(%rsp), %xmm0 + movsd save_y(%rsp), %xmm1 + movd %r11, %xmm2 + mov .L__flag_x_nan(%rip), %edi + + call fname_special + jmp .L__final_check + +.p2align 4,,15 +.L__y_is_nan: + + mov .L__qnan_set(%rip), %r10 + and %r8, %r10 + movd %r11, %xmm0 + jnz .L__final_check + + movsd save_x(%rsp), %xmm0 + movsd save_y(%rsp), %xmm1 + movd %r11, %xmm2 + mov .L__flag_y_nan(%rip), %edi + + call fname_special + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_nan_y_is_nan: + + mov .L__qnan_set(%rip), %r9 + and %rdx, %r9 + jz .L__continue_xy_nan + + mov .L__qnan_set(%rip), %r10 + and %r8, %r10 + jz .L__continue_xy_nan + + movd %r11, %xmm0 + jmp .L__final_check + +.L__continue_xy_nan: + movsd save_x(%rsp), %xmm0 + movsd save_y(%rsp), %xmm1 + movd %r11, %xmm2 + mov .L__flag_x_nan_y_nan(%rip), %edi + + call fname_special + jmp .L__final_check + +.p2align 4,,15 +.L__z_denormal: + + movsd %xmm0, %xmm2 + movsd save_x(%rsp), %xmm0 + movsd save_y(%rsp), %xmm1 + mov .L__flag_z_denormal(%rip), %edi + + call fname_special + jmp .L__final_check + + +.data +.align 16 + +# these codes and the ones in the corresponding .c file have to match +.L__flag_x_one_y_snan: .long 1 +.L__flag_x_zero_z_inf: .long 2 +.L__flag_x_nan: .long 3 +.L__flag_y_nan: .long 4 +.L__flag_x_nan_y_nan: .long 5 +.L__flag_x_neg_y_notint: .long 6 +.L__flag_z_zero: .long 7 +.L__flag_z_denormal: .long 8 +.L__flag_z_inf: .long 9 + +.align 16 + +.L__ay_max_bound: .quad 0x43e0000000000000 +.L__ay_min_bound: .quad 0x3c00000000000000 +.L__sign_mask: .quad 0x8000000000000000 +.L__sign_and_exp_mask: .quad 0x0fff0000000000000 +.L__exp_mask: .quad 0x7ff0000000000000 +.L__neg_inf: .quad 0x0fff0000000000000 +.L__pos_inf: .quad 0x7ff0000000000000 +.L__pos_one: .quad 0x3ff0000000000000 +.L__pos_zero: .quad 0x0000000000000000 +.L__exp_mant_mask: .quad 0x7fffffffffffffff +.L__mant_mask: .quad 0x000fffffffffffff +.L__ind_pattern: .quad 0x0fff8000000000000 + +.L__neg_qnan: .quad 0x0fff8000000000000 +.L__qnan: .quad 0x7ff8000000000000 +.L__qnan_set: .quad 0x0008000000000000 + +.L__neg_one: .quad 0x0bff0000000000000 +.L__neg_zero: .quad 0x8000000000000000 + +.L__exp_shift: .quad 0x0000000000000034 # 52 +.L__exp_bias: .quad 0x00000000000003ff # 1023 +.L__exp_bias_m1: .quad 0x00000000000003fe # 1022 + +.L__yexp_53: .quad 0x0000000000000035 # 53 +.L__mant_full: .quad 0x000fffffffffffff +.L__1_before_mant: .quad 0x0010000000000000 + +.L__mask_mant_all8: .quad 0x000ff00000000000 +.L__mask_mant9: .quad 0x0000080000000000 + +.align 16 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 + .quad 0x0fffffffff8000000 + +.L__mask_8000000000000000: .quad 0x8000000000000000 + .quad 0x8000000000000000 + +.L__real_4090040000000000: .quad 0x4090040000000000 + .quad 0x4090040000000000 + +.L__real_C090C80000000000: .quad 0x0C090C80000000000 + .quad 0x0C090C80000000000 + +#--------------------- +# log data +#--------------------- + +.align 16 + +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0000000000000000 +.L__real_inf: .quad 0x7ff0000000000000 # +inf + .quad 0x0000000000000000 +.L__real_nan: .quad 0x7ff8000000000000 # NaN + .quad 0x0000000000000000 +.L__real_mant: .quad 0x000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000000000000000 +.L__mask_1023: .quad 0x00000000000003ff + .quad 0x0000000000000000 +.L__mask_001: .quad 0x0000000000000001 + .quad 0x0000000000000000 + + +.L__real_log2_lead: .quad 0x3fe62e42e0000000 # log2_lead 6.93147122859954833984e-01 + .quad 0x0000000000000000 +.L__real_log2_tail: .quad 0x3e6efa39ef35793c # log2_tail 5.76999904754328540596e-08 + .quad 0x0000000000000000 + +.L__real_two: .quad 0x4000000000000000 # 2 + .quad 0x0000000000000000 + +.L__real_one: .quad 0x3ff0000000000000 # 1 + .quad 0x0000000000000000 + +.L__real_half: .quad 0x3fe0000000000000 # 1/2 + .quad 0x0000000000000000 + +.L__mask_100: .quad 0x0000000000000100 + .quad 0x0000000000000000 + +.L__real_1_over_2: .quad 0x3fe0000000000000 + .quad 0x0000000000000000 +.L__real_1_over_3: .quad 0x3fd5555555555555 + .quad 0x0000000000000000 +.L__real_1_over_4: .quad 0x3fd0000000000000 + .quad 0x0000000000000000 +.L__real_1_over_5: .quad 0x3fc999999999999a + .quad 0x0000000000000000 +.L__real_1_over_6: .quad 0x3fc5555555555555 + .quad 0x0000000000000000 +.L__real_1_over_7: .quad 0x3fc2492492492494 + .quad 0x0000000000000000 + +.L__mask_1023_f: .quad 0x0c08ff80000000000 + .quad 0x0000000000000000 + +.L__mask_2045: .quad 0x00000000000007fd + .quad 0x0000000000000000 + +.L__real_threshold: .quad 0x3fc0000000000000 # 0.125 + .quad 0x3fc0000000000000 + +.L__real_notsign: .quad 0x7ffFFFFFFFFFFFFF # ^sign bit + .quad 0x0000000000000000 + + +.align 16 +.L__log_256_lead: + .quad 0x0000000000000000 + .quad 0x3f6ff00aa0000000 + .quad 0x3f7fe02a60000000 + .quad 0x3f87dc4750000000 + .quad 0x3f8fc0a8b0000000 + .quad 0x3f93cea440000000 + .quad 0x3f97b91b00000000 + .quad 0x3f9b9fc020000000 + .quad 0x3f9f829b00000000 + .quad 0x3fa1b0d980000000 + .quad 0x3fa39e87b0000000 + .quad 0x3fa58a5ba0000000 + .quad 0x3fa77458f0000000 + .quad 0x3fa95c8300000000 + .quad 0x3fab42dd70000000 + .quad 0x3fad276b80000000 + .quad 0x3faf0a30c0000000 + .quad 0x3fb0759830000000 + .quad 0x3fb16536e0000000 + .quad 0x3fb253f620000000 + .quad 0x3fb341d790000000 + .quad 0x3fb42edcb0000000 + .quad 0x3fb51b0730000000 + .quad 0x3fb60658a0000000 + .quad 0x3fb6f0d280000000 + .quad 0x3fb7da7660000000 + .quad 0x3fb8c345d0000000 + .quad 0x3fb9ab4240000000 + .quad 0x3fba926d30000000 + .quad 0x3fbb78c820000000 + .quad 0x3fbc5e5480000000 + .quad 0x3fbd4313d0000000 + .quad 0x3fbe270760000000 + .quad 0x3fbf0a30c0000000 + .quad 0x3fbfec9130000000 + .quad 0x3fc0671510000000 + .quad 0x3fc0d77e70000000 + .quad 0x3fc1478580000000 + .quad 0x3fc1b72ad0000000 + .quad 0x3fc2266f10000000 + .quad 0x3fc29552f0000000 + .quad 0x3fc303d710000000 + .quad 0x3fc371fc20000000 + .quad 0x3fc3dfc2b0000000 + .quad 0x3fc44d2b60000000 + .quad 0x3fc4ba36f0000000 + .quad 0x3fc526e5e0000000 + .quad 0x3fc59338d0000000 + .quad 0x3fc5ff3070000000 + .quad 0x3fc66acd40000000 + .quad 0x3fc6d60fe0000000 + .quad 0x3fc740f8f0000000 + .quad 0x3fc7ab8900000000 + .quad 0x3fc815c0a0000000 + .quad 0x3fc87fa060000000 + .quad 0x3fc8e928d0000000 + .quad 0x3fc9525a90000000 + .quad 0x3fc9bb3620000000 + .quad 0x3fca23bc10000000 + .quad 0x3fca8becf0000000 + .quad 0x3fcaf3c940000000 + .quad 0x3fcb5b5190000000 + .quad 0x3fcbc28670000000 + .quad 0x3fcc296850000000 + .quad 0x3fcc8ff7c0000000 + .quad 0x3fccf63540000000 + .quad 0x3fcd5c2160000000 + .quad 0x3fcdc1bca0000000 + .quad 0x3fce270760000000 + .quad 0x3fce8c0250000000 + .quad 0x3fcef0adc0000000 + .quad 0x3fcf550a50000000 + .quad 0x3fcfb91860000000 + .quad 0x3fd00e6c40000000 + .quad 0x3fd0402590000000 + .quad 0x3fd071b850000000 + .quad 0x3fd0a324e0000000 + .quad 0x3fd0d46b50000000 + .quad 0x3fd1058bf0000000 + .quad 0x3fd1368700000000 + .quad 0x3fd1675ca0000000 + .quad 0x3fd1980d20000000 + .quad 0x3fd1c898c0000000 + .quad 0x3fd1f8ff90000000 + .quad 0x3fd22941f0000000 + .quad 0x3fd2596010000000 + .quad 0x3fd2895a10000000 + .quad 0x3fd2b93030000000 + .quad 0x3fd2e8e2b0000000 + .quad 0x3fd31871c0000000 + .quad 0x3fd347dd90000000 + .quad 0x3fd3772660000000 + .quad 0x3fd3a64c50000000 + .quad 0x3fd3d54fa0000000 + .quad 0x3fd4043080000000 + .quad 0x3fd432ef20000000 + .quad 0x3fd4618bc0000000 + .quad 0x3fd4900680000000 + .quad 0x3fd4be5f90000000 + .quad 0x3fd4ec9730000000 + .quad 0x3fd51aad80000000 + .quad 0x3fd548a2c0000000 + .quad 0x3fd5767710000000 + .quad 0x3fd5a42ab0000000 + .quad 0x3fd5d1bdb0000000 + .quad 0x3fd5ff3070000000 + .quad 0x3fd62c82f0000000 + .quad 0x3fd659b570000000 + .quad 0x3fd686c810000000 + .quad 0x3fd6b3bb20000000 + .quad 0x3fd6e08ea0000000 + .quad 0x3fd70d42e0000000 + .quad 0x3fd739d7f0000000 + .quad 0x3fd7664e10000000 + .quad 0x3fd792a550000000 + .quad 0x3fd7bede00000000 + .quad 0x3fd7eaf830000000 + .quad 0x3fd816f410000000 + .quad 0x3fd842d1d0000000 + .quad 0x3fd86e9190000000 + .quad 0x3fd89a3380000000 + .quad 0x3fd8c5b7c0000000 + .quad 0x3fd8f11e80000000 + .quad 0x3fd91c67e0000000 + .quad 0x3fd9479410000000 + .quad 0x3fd972a340000000 + .quad 0x3fd99d9580000000 + .quad 0x3fd9c86b00000000 + .quad 0x3fd9f323e0000000 + .quad 0x3fda1dc060000000 + .quad 0x3fda484090000000 + .quad 0x3fda72a490000000 + .quad 0x3fda9cec90000000 + .quad 0x3fdac718c0000000 + .quad 0x3fdaf12930000000 + .quad 0x3fdb1b1e00000000 + .quad 0x3fdb44f770000000 + .quad 0x3fdb6eb590000000 + .quad 0x3fdb985890000000 + .quad 0x3fdbc1e080000000 + .quad 0x3fdbeb4d90000000 + .quad 0x3fdc149ff0000000 + .quad 0x3fdc3dd7a0000000 + .quad 0x3fdc66f4e0000000 + .quad 0x3fdc8ff7c0000000 + .quad 0x3fdcb8e070000000 + .quad 0x3fdce1af00000000 + .quad 0x3fdd0a63a0000000 + .quad 0x3fdd32fe70000000 + .quad 0x3fdd5b7f90000000 + .quad 0x3fdd83e720000000 + .quad 0x3fddac3530000000 + .quad 0x3fddd46a00000000 + .quad 0x3fddfc8590000000 + .quad 0x3fde248810000000 + .quad 0x3fde4c71a0000000 + .quad 0x3fde744260000000 + .quad 0x3fde9bfa60000000 + .quad 0x3fdec399d0000000 + .quad 0x3fdeeb20c0000000 + .quad 0x3fdf128f50000000 + .quad 0x3fdf39e5b0000000 + .quad 0x3fdf6123f0000000 + .quad 0x3fdf884a30000000 + .quad 0x3fdfaf5880000000 + .quad 0x3fdfd64f20000000 + .quad 0x3fdffd2e00000000 + .quad 0x3fe011fab0000000 + .quad 0x3fe02552a0000000 + .quad 0x3fe0389ee0000000 + .quad 0x3fe04bdf90000000 + .quad 0x3fe05f14b0000000 + .quad 0x3fe0723e50000000 + .quad 0x3fe0855c80000000 + .quad 0x3fe0986f40000000 + .quad 0x3fe0ab76b0000000 + .quad 0x3fe0be72e0000000 + .quad 0x3fe0d163c0000000 + .quad 0x3fe0e44980000000 + .quad 0x3fe0f72410000000 + .quad 0x3fe109f390000000 + .quad 0x3fe11cb810000000 + .quad 0x3fe12f7190000000 + .quad 0x3fe1422020000000 + .quad 0x3fe154c3d0000000 + .quad 0x3fe1675ca0000000 + .quad 0x3fe179eab0000000 + .quad 0x3fe18c6e00000000 + .quad 0x3fe19ee6b0000000 + .quad 0x3fe1b154b0000000 + .quad 0x3fe1c3b810000000 + .quad 0x3fe1d610f0000000 + .quad 0x3fe1e85f50000000 + .quad 0x3fe1faa340000000 + .quad 0x3fe20cdcd0000000 + .quad 0x3fe21f0bf0000000 + .quad 0x3fe23130d0000000 + .quad 0x3fe2434b60000000 + .quad 0x3fe2555bc0000000 + .quad 0x3fe2676200000000 + .quad 0x3fe2795e10000000 + .quad 0x3fe28b5000000000 + .quad 0x3fe29d37f0000000 + .quad 0x3fe2af15f0000000 + .quad 0x3fe2c0e9e0000000 + .quad 0x3fe2d2b400000000 + .quad 0x3fe2e47430000000 + .quad 0x3fe2f62a90000000 + .quad 0x3fe307d730000000 + .quad 0x3fe3197a00000000 + .quad 0x3fe32b1330000000 + .quad 0x3fe33ca2b0000000 + .quad 0x3fe34e2890000000 + .quad 0x3fe35fa4e0000000 + .quad 0x3fe37117b0000000 + .quad 0x3fe38280f0000000 + .quad 0x3fe393e0d0000000 + .quad 0x3fe3a53730000000 + .quad 0x3fe3b68440000000 + .quad 0x3fe3c7c7f0000000 + .quad 0x3fe3d90260000000 + .quad 0x3fe3ea3390000000 + .quad 0x3fe3fb5b80000000 + .quad 0x3fe40c7a40000000 + .quad 0x3fe41d8fe0000000 + .quad 0x3fe42e9c60000000 + .quad 0x3fe43f9fe0000000 + .quad 0x3fe4509a50000000 + .quad 0x3fe4618bc0000000 + .quad 0x3fe4727430000000 + .quad 0x3fe48353d0000000 + .quad 0x3fe4942a80000000 + .quad 0x3fe4a4f850000000 + .quad 0x3fe4b5bd60000000 + .quad 0x3fe4c679a0000000 + .quad 0x3fe4d72d30000000 + .quad 0x3fe4e7d810000000 + .quad 0x3fe4f87a30000000 + .quad 0x3fe50913c0000000 + .quad 0x3fe519a4c0000000 + .quad 0x3fe52a2d20000000 + .quad 0x3fe53aad00000000 + .quad 0x3fe54b2460000000 + .quad 0x3fe55b9350000000 + .quad 0x3fe56bf9d0000000 + .quad 0x3fe57c57f0000000 + .quad 0x3fe58cadb0000000 + .quad 0x3fe59cfb20000000 + .quad 0x3fe5ad4040000000 + .quad 0x3fe5bd7d30000000 + .quad 0x3fe5cdb1d0000000 + .quad 0x3fe5ddde50000000 + .quad 0x3fe5ee02a0000000 + .quad 0x3fe5fe1ed0000000 + .quad 0x3fe60e32f0000000 + .quad 0x3fe61e3ef0000000 + .quad 0x3fe62e42e0000000 + .quad 0x0000000000000000 + +.align 16 +.L__log_256_tail: + .quad 0x0000000000000000 + .quad 0x3db5885e0250435a + .quad 0x3de620cf11f86ed2 + .quad 0x3dff0214edba4a25 + .quad 0x3dbf807c79f3db4e + .quad 0x3dea352ba779a52b + .quad 0x3dff56c46aa49fd5 + .quad 0x3dfebe465fef5196 + .quad 0x3e0cf0660099f1f8 + .quad 0x3e1247b2ff85945d + .quad 0x3e13fd7abf5202b6 + .quad 0x3e1f91c9a918d51e + .quad 0x3e08cb73f118d3ca + .quad 0x3e1d91c7d6fad074 + .quad 0x3de1971bec28d14c + .quad 0x3e15b616a423c78a + .quad 0x3da162a6617cc971 + .quad 0x3e166391c4c06d29 + .quad 0x3e2d46f5c1d0c4b8 + .quad 0x3e2e14282df1f6d3 + .quad 0x3e186f47424a660d + .quad 0x3e2d4c8de077753e + .quad 0x3e2e0c307ed24f1c + .quad 0x3e226ea18763bdd3 + .quad 0x3e25cad69737c933 + .quad 0x3e2af62599088901 + .quad 0x3e18c66c83d6b2d0 + .quad 0x3e1880ceb36fb30f + .quad 0x3e2495aac6ca17a4 + .quad 0x3e2761db4210878c + .quad 0x3e2eb78e862bac2f + .quad 0x3e19b2cd75790dd9 + .quad 0x3e2c55e5cbd3d50f + .quad 0x3db162a6617cc971 + .quad 0x3dfdbeabaaa2e519 + .quad 0x3e1652cb7150c647 + .quad 0x3e39a11cb2cd2ee2 + .quad 0x3e219d0ab1a28813 + .quad 0x3e24bd9e80a41811 + .quad 0x3e3214b596faa3df + .quad 0x3e303fea46980bb8 + .quad 0x3e31c8ffa5fd28c7 + .quad 0x3dce8f743bcd96c5 + .quad 0x3dfd98c5395315c6 + .quad 0x3e3996fa3ccfa7b2 + .quad 0x3e1cd2af2ad13037 + .quad 0x3e1d0da1bd17200e + .quad 0x3e3330410ba68b75 + .quad 0x3df4f27a790e7c41 + .quad 0x3e13956a86f6ff1b + .quad 0x3e2c6748723551d9 + .quad 0x3e2500de9326cdfc + .quad 0x3e1086c848df1b59 + .quad 0x3e04357ead6836ff + .quad 0x3e24832442408024 + .quad 0x3e3d10da8154b13d + .quad 0x3e39e8ad68ec8260 + .quad 0x3e3cfbf706abaf18 + .quad 0x3e3fc56ac6326e23 + .quad 0x3e39105e3185cf21 + .quad 0x3e3d017fe5b19cc0 + .quad 0x3e3d1f6b48dd13fe + .quad 0x3e20b63358a7e73a + .quad 0x3e263063028c211c + .quad 0x3e2e6a6886b09760 + .quad 0x3e3c138bb891cd03 + .quad 0x3e369f7722b7221a + .quad 0x3df57d8fac1a628c + .quad 0x3e3c55e5cbd3d50f + .quad 0x3e1552d2ff48fe2e + .quad 0x3e37b8b26ca431bc + .quad 0x3e292decdc1c5f6d + .quad 0x3e3abc7c551aaa8c + .quad 0x3e36b540731a354b + .quad 0x3e32d341036b89ef + .quad 0x3e4f9ab21a3a2e0f + .quad 0x3e239c871afb9fbd + .quad 0x3e3e6add2c81f640 + .quad 0x3e435c95aa313f41 + .quad 0x3e249d4582f6cc53 + .quad 0x3e47574c1c07398f + .quad 0x3e4ba846dece9e8d + .quad 0x3e16999fafbc68e7 + .quad 0x3e4c9145e51b0103 + .quad 0x3e479ef2cb44850a + .quad 0x3e0beec73de11275 + .quad 0x3e2ef4351af5a498 + .quad 0x3e45713a493b4a50 + .quad 0x3e45c23a61385992 + .quad 0x3e42a88309f57299 + .quad 0x3e4530faa9ac8ace + .quad 0x3e25fec2d792a758 + .quad 0x3e35a517a71cbcd7 + .quad 0x3e3707dc3e1cd9a3 + .quad 0x3e3a1a9f8ef43049 + .quad 0x3e4409d0276b3674 + .quad 0x3e20e2f613e85bd9 + .quad 0x3df0027433001e5f + .quad 0x3e35dde2836d3265 + .quad 0x3e2300134d7aaf04 + .quad 0x3e3cb7e0b42724f5 + .quad 0x3e2d6e93167e6308 + .quad 0x3e3d1569b1526adb + .quad 0x3e0e99fc338a1a41 + .quad 0x3e4eb01394a11b1c + .quad 0x3e04f27a790e7c41 + .quad 0x3e25ce3ca97b7af9 + .quad 0x3e281f0f940ed857 + .quad 0x3e4d36295d88857c + .quad 0x3e21aca1ec4af526 + .quad 0x3e445743c7182726 + .quad 0x3e23c491aead337e + .quad 0x3e3aef401a738931 + .quad 0x3e21cede76092a29 + .quad 0x3e4fba8f44f82bb4 + .quad 0x3e446f5f7f3c3e1a + .quad 0x3e47055f86c9674b + .quad 0x3e4b41a92b6b6e1a + .quad 0x3e443d162e927628 + .quad 0x3e4466174013f9b1 + .quad 0x3e3b05096ad69c62 + .quad 0x3e40b169150faa58 + .quad 0x3e3cd98b1df85da7 + .quad 0x3e468b507b0f8fa8 + .quad 0x3e48422df57499ba + .quad 0x3e11351586970274 + .quad 0x3e117e08acba92ee + .quad 0x3e26e04314dd0229 + .quad 0x3e497f3097e56d1a + .quad 0x3e3356e655901286 + .quad 0x3e0cb761457f94d6 + .quad 0x3e39af67a85a9dac + .quad 0x3e453410931a909f + .quad 0x3e22c587206058f5 + .quad 0x3e223bc358899c22 + .quad 0x3e4d7bf8b6d223cb + .quad 0x3e47991ec5197ddb + .quad 0x3e4a79e6bb3a9219 + .quad 0x3e3a4c43ed663ec5 + .quad 0x3e461b5a1484f438 + .quad 0x3e4b4e36f7ef0c3a + .quad 0x3e115f026acd0d1b + .quad 0x3e3f36b535cecf05 + .quad 0x3e2ffb7fbf3eb5c6 + .quad 0x3e3e6a6886b09760 + .quad 0x3e3135eb27f5bbc3 + .quad 0x3e470be7d6f6fa57 + .quad 0x3e4ce43cc84ab338 + .quad 0x3e4c01d7aac3bd91 + .quad 0x3e45c58d07961060 + .quad 0x3e3628bcf941456e + .quad 0x3e4c58b2a8461cd2 + .quad 0x3e33071282fb989a + .quad 0x3e420dab6a80f09c + .quad 0x3e44f8d84c397b1e + .quad 0x3e40d0ee08599e48 + .quad 0x3e1d68787e37da36 + .quad 0x3e366187d591bafc + .quad 0x3e22346600bae772 + .quad 0x3e390377d0d61b8e + .quad 0x3e4f5e0dd966b907 + .quad 0x3e49023cb79a00e2 + .quad 0x3e44e05158c28ad8 + .quad 0x3e3bfa7b08b18ae4 + .quad 0x3e4ef1e63db35f67 + .quad 0x3e0ec2ae39493d4f + .quad 0x3e40afe930ab2fa0 + .quad 0x3e225ff8a1810dd4 + .quad 0x3e469743fb1a71a5 + .quad 0x3e5f9cc676785571 + .quad 0x3e5b524da4cbf982 + .quad 0x3e5a4c8b381535b8 + .quad 0x3e5839be809caf2c + .quad 0x3e50968a1cb82c13 + .quad 0x3e5eae6a41723fb5 + .quad 0x3e5d9c29a380a4db + .quad 0x3e4094aa0ada625e + .quad 0x3e5973ad6fc108ca + .quad 0x3e4747322fdbab97 + .quad 0x3e593692fa9d4221 + .quad 0x3e5c5a992dfbc7d9 + .quad 0x3e4e1f33e102387a + .quad 0x3e464fbef14c048c + .quad 0x3e4490f513ca5e3b + .quad 0x3e37a6af4d4c799d + .quad 0x3e57574c1c07398f + .quad 0x3e57b133417f8c1c + .quad 0x3e5feb9e0c176514 + .quad 0x3e419f25bb3172f7 + .quad 0x3e45f68a7bbfb852 + .quad 0x3e5ee278497929f1 + .quad 0x3e5ccee006109d58 + .quad 0x3e5ce081a07bd8b3 + .quad 0x3e570e12981817b8 + .quad 0x3e292ab6d93503d0 + .quad 0x3e58cb7dd7c3b61e + .quad 0x3e4efafd0a0b78da + .quad 0x3e5e907267c4288e + .quad 0x3e5d31ef96780875 + .quad 0x3e23430dfcd2ad50 + .quad 0x3e344d88d75bc1f9 + .quad 0x3e5bec0f055e04fc + .quad 0x3e5d85611590b9ad + .quad 0x3df320568e583229 + .quad 0x3e5a891d1772f538 + .quad 0x3e22edc9dabba74d + .quad 0x3e4b9009a1015086 + .quad 0x3e52a12a8c5b1a19 + .quad 0x3e3a7885f0fdac85 + .quad 0x3e5f4ffcd43ac691 + .quad 0x3e52243ae2640aad + .quad 0x3e546513299035d3 + .quad 0x3e5b39c3a62dd725 + .quad 0x3e5ba6dd40049f51 + .quad 0x3e451d1ed7177409 + .quad 0x3e5cb0f2fd7f5216 + .quad 0x3e3ab150cd4e2213 + .quad 0x3e5cfd7bf3193844 + .quad 0x3e53fff8455f1dbd + .quad 0x3e5fee640b905fc9 + .quad 0x3e54e2adf548084c + .quad 0x3e3b597adc1ecdd2 + .quad 0x3e4345bd096d3a75 + .quad 0x3e5101b9d2453c8b + .quad 0x3e508ce55cc8c979 + .quad 0x3e5bbf017e595f71 + .quad 0x3e37ce733bd393dc + .quad 0x3e233bb0a503f8a1 + .quad 0x3e30e2f613e85bd9 + .quad 0x3e5e67555a635b3c + .quad 0x3e2ea88df73d5e8b + .quad 0x3e3d17e03bda18a8 + .quad 0x3e5b607d76044f7e + .quad 0x3e52adc4e71bc2fc + .quad 0x3e5f99dc7362d1d9 + .quad 0x3e5473fa008e6a6a + .quad 0x3e2b75bb09cb0985 + .quad 0x3e5ea04dd10b9aba + .quad 0x3e5802d0d6979674 + .quad 0x3e174688ccd99094 + .quad 0x3e496f16abb9df22 + .quad 0x3e46e66df2aa374f + .quad 0x3e4e66525ea4550a + .quad 0x3e42d02f34f20cbd + .quad 0x3e46cfce65047188 + .quad 0x3e39b78c842d58b8 + .quad 0x3e4735e624c24bc9 + .quad 0x3e47eba1f7dd1adf + .quad 0x3e586b3e59f65355 + .quad 0x3e1ce38e637f1b4d + .quad 0x3e58d82ec919edc7 + .quad 0x3e4c52648ddcfa37 + .quad 0x3e52482ceae1ac12 + .quad 0x3e55a312311aba4f + .quad 0x3e411e236329f225 + .quad 0x3e5b48c8cd2f246c + .quad 0x3e6efa39ef35793c + .quad 0x0000000000000000 + +.align 16 +.L__log_F_inv_head: + .quad 0x4000000000000000 + .quad 0x3fffe00000000000 + .quad 0x3fffc00000000000 + .quad 0x3fffa00000000000 + .quad 0x3fff800000000000 + .quad 0x3fff600000000000 + .quad 0x3fff400000000000 + .quad 0x3fff200000000000 + .quad 0x3fff000000000000 + .quad 0x3ffee00000000000 + .quad 0x3ffec00000000000 + .quad 0x3ffea00000000000 + .quad 0x3ffe900000000000 + .quad 0x3ffe700000000000 + .quad 0x3ffe500000000000 + .quad 0x3ffe300000000000 + .quad 0x3ffe100000000000 + .quad 0x3ffe000000000000 + .quad 0x3ffde00000000000 + .quad 0x3ffdc00000000000 + .quad 0x3ffda00000000000 + .quad 0x3ffd900000000000 + .quad 0x3ffd700000000000 + .quad 0x3ffd500000000000 + .quad 0x3ffd400000000000 + .quad 0x3ffd200000000000 + .quad 0x3ffd000000000000 + .quad 0x3ffcf00000000000 + .quad 0x3ffcd00000000000 + .quad 0x3ffcb00000000000 + .quad 0x3ffca00000000000 + .quad 0x3ffc800000000000 + .quad 0x3ffc700000000000 + .quad 0x3ffc500000000000 + .quad 0x3ffc300000000000 + .quad 0x3ffc200000000000 + .quad 0x3ffc000000000000 + .quad 0x3ffbf00000000000 + .quad 0x3ffbd00000000000 + .quad 0x3ffbc00000000000 + .quad 0x3ffba00000000000 + .quad 0x3ffb900000000000 + .quad 0x3ffb700000000000 + .quad 0x3ffb600000000000 + .quad 0x3ffb400000000000 + .quad 0x3ffb300000000000 + .quad 0x3ffb200000000000 + .quad 0x3ffb000000000000 + .quad 0x3ffaf00000000000 + .quad 0x3ffad00000000000 + .quad 0x3ffac00000000000 + .quad 0x3ffaa00000000000 + .quad 0x3ffa900000000000 + .quad 0x3ffa800000000000 + .quad 0x3ffa600000000000 + .quad 0x3ffa500000000000 + .quad 0x3ffa400000000000 + .quad 0x3ffa200000000000 + .quad 0x3ffa100000000000 + .quad 0x3ffa000000000000 + .quad 0x3ff9e00000000000 + .quad 0x3ff9d00000000000 + .quad 0x3ff9c00000000000 + .quad 0x3ff9a00000000000 + .quad 0x3ff9900000000000 + .quad 0x3ff9800000000000 + .quad 0x3ff9700000000000 + .quad 0x3ff9500000000000 + .quad 0x3ff9400000000000 + .quad 0x3ff9300000000000 + .quad 0x3ff9200000000000 + .quad 0x3ff9000000000000 + .quad 0x3ff8f00000000000 + .quad 0x3ff8e00000000000 + .quad 0x3ff8d00000000000 + .quad 0x3ff8b00000000000 + .quad 0x3ff8a00000000000 + .quad 0x3ff8900000000000 + .quad 0x3ff8800000000000 + .quad 0x3ff8700000000000 + .quad 0x3ff8600000000000 + .quad 0x3ff8400000000000 + .quad 0x3ff8300000000000 + .quad 0x3ff8200000000000 + .quad 0x3ff8100000000000 + .quad 0x3ff8000000000000 + .quad 0x3ff7f00000000000 + .quad 0x3ff7e00000000000 + .quad 0x3ff7d00000000000 + .quad 0x3ff7b00000000000 + .quad 0x3ff7a00000000000 + .quad 0x3ff7900000000000 + .quad 0x3ff7800000000000 + .quad 0x3ff7700000000000 + .quad 0x3ff7600000000000 + .quad 0x3ff7500000000000 + .quad 0x3ff7400000000000 + .quad 0x3ff7300000000000 + .quad 0x3ff7200000000000 + .quad 0x3ff7100000000000 + .quad 0x3ff7000000000000 + .quad 0x3ff6f00000000000 + .quad 0x3ff6e00000000000 + .quad 0x3ff6d00000000000 + .quad 0x3ff6c00000000000 + .quad 0x3ff6b00000000000 + .quad 0x3ff6a00000000000 + .quad 0x3ff6900000000000 + .quad 0x3ff6800000000000 + .quad 0x3ff6700000000000 + .quad 0x3ff6600000000000 + .quad 0x3ff6500000000000 + .quad 0x3ff6400000000000 + .quad 0x3ff6300000000000 + .quad 0x3ff6200000000000 + .quad 0x3ff6100000000000 + .quad 0x3ff6000000000000 + .quad 0x3ff5f00000000000 + .quad 0x3ff5e00000000000 + .quad 0x3ff5d00000000000 + .quad 0x3ff5c00000000000 + .quad 0x3ff5b00000000000 + .quad 0x3ff5a00000000000 + .quad 0x3ff5900000000000 + .quad 0x3ff5800000000000 + .quad 0x3ff5800000000000 + .quad 0x3ff5700000000000 + .quad 0x3ff5600000000000 + .quad 0x3ff5500000000000 + .quad 0x3ff5400000000000 + .quad 0x3ff5300000000000 + .quad 0x3ff5200000000000 + .quad 0x3ff5100000000000 + .quad 0x3ff5000000000000 + .quad 0x3ff5000000000000 + .quad 0x3ff4f00000000000 + .quad 0x3ff4e00000000000 + .quad 0x3ff4d00000000000 + .quad 0x3ff4c00000000000 + .quad 0x3ff4b00000000000 + .quad 0x3ff4a00000000000 + .quad 0x3ff4a00000000000 + .quad 0x3ff4900000000000 + .quad 0x3ff4800000000000 + .quad 0x3ff4700000000000 + .quad 0x3ff4600000000000 + .quad 0x3ff4600000000000 + .quad 0x3ff4500000000000 + .quad 0x3ff4400000000000 + .quad 0x3ff4300000000000 + .quad 0x3ff4200000000000 + .quad 0x3ff4200000000000 + .quad 0x3ff4100000000000 + .quad 0x3ff4000000000000 + .quad 0x3ff3f00000000000 + .quad 0x3ff3e00000000000 + .quad 0x3ff3e00000000000 + .quad 0x3ff3d00000000000 + .quad 0x3ff3c00000000000 + .quad 0x3ff3b00000000000 + .quad 0x3ff3b00000000000 + .quad 0x3ff3a00000000000 + .quad 0x3ff3900000000000 + .quad 0x3ff3800000000000 + .quad 0x3ff3800000000000 + .quad 0x3ff3700000000000 + .quad 0x3ff3600000000000 + .quad 0x3ff3500000000000 + .quad 0x3ff3500000000000 + .quad 0x3ff3400000000000 + .quad 0x3ff3300000000000 + .quad 0x3ff3200000000000 + .quad 0x3ff3200000000000 + .quad 0x3ff3100000000000 + .quad 0x3ff3000000000000 + .quad 0x3ff3000000000000 + .quad 0x3ff2f00000000000 + .quad 0x3ff2e00000000000 + .quad 0x3ff2e00000000000 + .quad 0x3ff2d00000000000 + .quad 0x3ff2c00000000000 + .quad 0x3ff2b00000000000 + .quad 0x3ff2b00000000000 + .quad 0x3ff2a00000000000 + .quad 0x3ff2900000000000 + .quad 0x3ff2900000000000 + .quad 0x3ff2800000000000 + .quad 0x3ff2700000000000 + .quad 0x3ff2700000000000 + .quad 0x3ff2600000000000 + .quad 0x3ff2500000000000 + .quad 0x3ff2500000000000 + .quad 0x3ff2400000000000 + .quad 0x3ff2300000000000 + .quad 0x3ff2300000000000 + .quad 0x3ff2200000000000 + .quad 0x3ff2100000000000 + .quad 0x3ff2100000000000 + .quad 0x3ff2000000000000 + .quad 0x3ff2000000000000 + .quad 0x3ff1f00000000000 + .quad 0x3ff1e00000000000 + .quad 0x3ff1e00000000000 + .quad 0x3ff1d00000000000 + .quad 0x3ff1c00000000000 + .quad 0x3ff1c00000000000 + .quad 0x3ff1b00000000000 + .quad 0x3ff1b00000000000 + .quad 0x3ff1a00000000000 + .quad 0x3ff1900000000000 + .quad 0x3ff1900000000000 + .quad 0x3ff1800000000000 + .quad 0x3ff1800000000000 + .quad 0x3ff1700000000000 + .quad 0x3ff1600000000000 + .quad 0x3ff1600000000000 + .quad 0x3ff1500000000000 + .quad 0x3ff1500000000000 + .quad 0x3ff1400000000000 + .quad 0x3ff1300000000000 + .quad 0x3ff1300000000000 + .quad 0x3ff1200000000000 + .quad 0x3ff1200000000000 + .quad 0x3ff1100000000000 + .quad 0x3ff1100000000000 + .quad 0x3ff1000000000000 + .quad 0x3ff0f00000000000 + .quad 0x3ff0f00000000000 + .quad 0x3ff0e00000000000 + .quad 0x3ff0e00000000000 + .quad 0x3ff0d00000000000 + .quad 0x3ff0d00000000000 + .quad 0x3ff0c00000000000 + .quad 0x3ff0c00000000000 + .quad 0x3ff0b00000000000 + .quad 0x3ff0a00000000000 + .quad 0x3ff0a00000000000 + .quad 0x3ff0900000000000 + .quad 0x3ff0900000000000 + .quad 0x3ff0800000000000 + .quad 0x3ff0800000000000 + .quad 0x3ff0700000000000 + .quad 0x3ff0700000000000 + .quad 0x3ff0600000000000 + .quad 0x3ff0600000000000 + .quad 0x3ff0500000000000 + .quad 0x3ff0500000000000 + .quad 0x3ff0400000000000 + .quad 0x3ff0400000000000 + .quad 0x3ff0300000000000 + .quad 0x3ff0300000000000 + .quad 0x3ff0200000000000 + .quad 0x3ff0200000000000 + .quad 0x3ff0100000000000 + .quad 0x3ff0100000000000 + .quad 0x3ff0000000000000 + .quad 0x3ff0000000000000 + +.align 16 +.L__log_F_inv_tail: + .quad 0x0000000000000000 + .quad 0x3effe01fe01fe020 + .quad 0x3f1fc07f01fc07f0 + .quad 0x3f31caa01fa11caa + .quad 0x3f3f81f81f81f820 + .quad 0x3f48856506ddaba6 + .quad 0x3f5196792909c560 + .quad 0x3f57d9108c2ad433 + .quad 0x3f5f07c1f07c1f08 + .quad 0x3f638ff08b1c03dd + .quad 0x3f680f6603d980f6 + .quad 0x3f6d00f57403d5d0 + .quad 0x3f331abf0b7672a0 + .quad 0x3f506a965d43919b + .quad 0x3f5ceb240795ceb2 + .quad 0x3f6522f3b834e67f + .quad 0x3f6c3c3c3c3c3c3c + .quad 0x3f3e01e01e01e01e + .quad 0x3f575b8fe21a291c + .quad 0x3f6403b9403b9404 + .quad 0x3f6cc0ed7303b5cc + .quad 0x3f479118f3fc4da2 + .quad 0x3f5ed952e0b0ce46 + .quad 0x3f695900eae56404 + .quad 0x3f3d41d41d41d41d + .quad 0x3f5cb28ff16c69ae + .quad 0x3f696b1edd80e866 + .quad 0x3f4372e225fe30d9 + .quad 0x3f60ad12073615a2 + .quad 0x3f6cdb2c0397cdb3 + .quad 0x3f52cc157b864407 + .quad 0x3f664cb5f7148404 + .quad 0x3f3c71c71c71c71c + .quad 0x3f6129a21a930b84 + .quad 0x3f6f1e0387f1e038 + .quad 0x3f5ad4e4ba80709b + .quad 0x3f6c0e070381c0e0 + .quad 0x3f560fba1a362bb0 + .quad 0x3f6a5713280dee96 + .quad 0x3f53f59620f9ece9 + .quad 0x3f69f22983759f23 + .quad 0x3f5478ac63fc8d5c + .quad 0x3f6ad87bb4671656 + .quad 0x3f578b8efbb8148c + .quad 0x3f6d0369d0369d03 + .quad 0x3f5d212b601b3748 + .quad 0x3f0b2036406c80d9 + .quad 0x3f629663b24547d1 + .quad 0x3f4435e50d79435e + .quad 0x3f67d0ff2920bc03 + .quad 0x3f55c06b15c06b16 + .quad 0x3f6e3a5f0fd7f954 + .quad 0x3f61dec0d4c77b03 + .quad 0x3f473289870ac52e + .quad 0x3f6a034da034da03 + .quad 0x3f5d041da2292856 + .quad 0x3f3a41a41a41a41a + .quad 0x3f68550f8a39409d + .quad 0x3f5b4fe5e92c0686 + .quad 0x3f3a01a01a01a01a + .quad 0x3f691d2a2067b23a + .quad 0x3f5e7c5dada0b4e5 + .quad 0x3f468a7725080ce1 + .quad 0x3f6c49d4aa21b490 + .quad 0x3f63333333333333 + .quad 0x3f54bc363b03fccf + .quad 0x3f2c9f01970e4f81 + .quad 0x3f697617c6ef5b25 + .quad 0x3f6161f9add3c0ca + .quad 0x3f5319fe6cb39806 + .quad 0x3f2f693a1c451ab3 + .quad 0x3f6a9e240321a9e2 + .quad 0x3f63831f3831f383 + .quad 0x3f5949ebc4dcfc1c + .quad 0x3f480c6980c6980c + .quad 0x3f6f9d00c5fe7403 + .quad 0x3f69721ed7e75347 + .quad 0x3f6381ec0313381f + .quad 0x3f5b97c2aec12653 + .quad 0x3f509ef3024ae3ba + .quad 0x3f38618618618618 + .quad 0x3f6e0184f00c2780 + .quad 0x3f692ef5657dba52 + .quad 0x3f64940305494030 + .quad 0x3f60303030303030 + .quad 0x3f58060180601806 + .quad 0x3f5017f405fd017f + .quad 0x3f412a8ad278e8dd + .quad 0x3f17d05f417d05f4 + .quad 0x3f6d67245c02f7d6 + .quad 0x3f6a4411c1d986a9 + .quad 0x3f6754d76c7316df + .quad 0x3f649902f149902f + .quad 0x3f621023358c1a68 + .quad 0x3f5f7390d2a6c406 + .quad 0x3f5b2b0805d5b2b1 + .quad 0x3f5745d1745d1746 + .quad 0x3f53c31507fa32c4 + .quad 0x3f50a1fd1b7af017 + .quad 0x3f4bc36ce3e0453a + .quad 0x3f4702e05c0b8170 + .quad 0x3f4300b79300b793 + .quad 0x3f3f76b4337c6cb1 + .quad 0x3f3a62681c860fb0 + .quad 0x3f36c16c16c16c17 + .quad 0x3f3490aa31a3cfc7 + .quad 0x3f33cd153729043e + .quad 0x3f3473a88d0bfd2e + .quad 0x3f36816816816817 + .quad 0x3f39f36016719f36 + .quad 0x3f3ec6a5122f9016 + .quad 0x3f427c29da5519cf + .quad 0x3f4642c8590b2164 + .quad 0x3f4ab5c45606f00b + .quad 0x3f4fd3b80b11fd3c + .quad 0x3f52cda0c6ba4eaa + .quad 0x3f56058160581606 + .quad 0x3f5990d0a4b7ef87 + .quad 0x3f5d6ee340579d6f + .quad 0x3f60cf87d9c54a69 + .quad 0x3f6310572620ae4c + .quad 0x3f65798c8ff522a2 + .quad 0x3f680ad602b580ad + .quad 0x3f6ac3e24799546f + .quad 0x3f6da46102b1da46 + .quad 0x3f15805601580560 + .quad 0x3f3ed3c506b39a23 + .quad 0x3f4cbdd3e2970f60 + .quad 0x3f55555555555555 + .quad 0x3f5c979aee0bf805 + .quad 0x3f621291e81fd58e + .quad 0x3f65fead500a9580 + .quad 0x3f6a0fd5c5f02a3a + .quad 0x3f6e45c223898adc + .quad 0x3f35015015015015 + .quad 0x3f4c7b16ea64d422 + .quad 0x3f57829cbc14e5e1 + .quad 0x3f60877db8589720 + .quad 0x3f65710e4b5edcea + .quad 0x3f6a7dbb4d1fc1c8 + .quad 0x3f6fad40a57eb503 + .quad 0x3f43fd6bb00a5140 + .quad 0x3f54e78ecb419ba9 + .quad 0x3f600a44029100a4 + .quad 0x3f65c28f5c28f5c3 + .quad 0x3f6b9c68b2c0cc4a + .quad 0x3f2978feb9f34381 + .quad 0x3f4ecf163bb6500a + .quad 0x3f5be1958b67ebb9 + .quad 0x3f644e6157dc9a3b + .quad 0x3f6acc4baa3f0ddf + .quad 0x3f26a4cbcb2a247b + .quad 0x3f50505050505050 + .quad 0x3f5e0b4439959819 + .quad 0x3f66027f6027f602 + .quad 0x3f6d1e854b5e0db4 + .quad 0x3f4165e7254813e2 + .quad 0x3f576646a9d716ef + .quad 0x3f632b48f757ce88 + .quad 0x3f6ac1b24652a906 + .quad 0x3f33b13b13b13b14 + .quad 0x3f5490e1eb208984 + .quad 0x3f62385830fec66e + .quad 0x3f6a45a6cc111b7e + .quad 0x3f33813813813814 + .quad 0x3f556f472517b708 + .quad 0x3f631be7bc0e8f2a + .quad 0x3f6b9cbf3e55f044 + .quad 0x3f40e7d95bc609a9 + .quad 0x3f59e6b3804d19e7 + .quad 0x3f65c8b6af7963c2 + .quad 0x3f6eb9dad43bf402 + .quad 0x3f4f1a515885fb37 + .quad 0x3f60eeb1d3d76c02 + .quad 0x3f6a320261a32026 + .quad 0x3f3c82ac40260390 + .quad 0x3f5a12f684bda12f + .quad 0x3f669d43fda2962c + .quad 0x3f02e025c04b8097 + .quad 0x3f542804b542804b + .quad 0x3f63f69b02593f6a + .quad 0x3f6df31cb46e21fa + .quad 0x3f5012b404ad012b + .quad 0x3f623925e7820a7f + .quad 0x3f6c8253c8253c82 + .quad 0x3f4b92ddc02526e5 + .quad 0x3f61602511602511 + .quad 0x3f6bf471439c9adf + .quad 0x3f4a85c40939a85c + .quad 0x3f6166f9ac024d16 + .quad 0x3f6c44e10125e227 + .quad 0x3f4cebf48bbd90e5 + .quad 0x3f62492492492492 + .quad 0x3f6d6f2e2ec0b673 + .quad 0x3f5159e26af37c05 + .quad 0x3f64024540245402 + .quad 0x3f6f6f0243f6f024 + .quad 0x3f55e60121579805 + .quad 0x3f668e18cf81b10f + .quad 0x3f32012012012012 + .quad 0x3f5c11f7047dc11f + .quad 0x3f69e878ff70985e + .quad 0x3f4779d9fdc3a219 + .quad 0x3f61eace5c957907 + .quad 0x3f6e0d5b450239e1 + .quad 0x3f548bf073816367 + .quad 0x3f6694808dda5202 + .quad 0x3f37c67f2bae2b21 + .quad 0x3f5ee58469ee5847 + .quad 0x3f6c0233c0233c02 + .quad 0x3f514e02328a7012 + .quad 0x3f6561072057b573 + .quad 0x3f31811811811812 + .quad 0x3f5e28646f5a1060 + .quad 0x3f6c0d1284e6f1d7 + .quad 0x3f523543f0c80459 + .quad 0x3f663cbeea4e1a09 + .quad 0x3f3b9a3fdd5c8cb8 + .quad 0x3f60be1c159a76d2 + .quad 0x3f6e1d1a688e4838 + .quad 0x3f572044d72044d7 + .quad 0x3f691713db81577b + .quad 0x3f4ac73ae9819b50 + .quad 0x3f6460334e904cf6 + .quad 0x3f31111111111111 + .quad 0x3f5feef80441fef0 + .quad 0x3f6de021fde021fe + .quad 0x3f57b7eacc9686a0 + .quad 0x3f69ead7cd391fbc + .quad 0x3f50195609804390 + .quad 0x3f6641511e8d2b32 + .quad 0x3f4222b1acf1ce96 + .quad 0x3f62e29f79b47582 + .quad 0x3f24f0d1682e11cd + .quad 0x3f5f9bb096771e4d + .quad 0x3f6e5ee45dd96ae2 + .quad 0x3f5a0429a0429a04 + .quad 0x3f6bb74d5f06c021 + .quad 0x3f54fce404254fce + .quad 0x3f695766eacbc402 + .quad 0x3f50842108421084 + .quad 0x3f673e5371d5c338 + .quad 0x3f4930523fbe3368 + .quad 0x3f656b38f225f6c4 + .quad 0x3f426e978d4fdf3b + .quad 0x3f63dd40e4eb0cc6 + .quad 0x3f397f7d73404146 + .quad 0x3f6293982cc98af1 + .quad 0x3f30410410410410 + .quad 0x3f618d6f048ff7e4 + .quad 0x3f2236a3ebc349de + .quad 0x3f60c9f8ee53d18c + .quad 0x3f10204081020408 + .quad 0x3f60486ca2f46ea6 + .quad 0x3ef0101010101010 + .quad 0x3f60080402010080 + .quad 0x0000000000000000 + +#--------------------- +# exp data +#--------------------- + +.align 16 + +.L__denormal_threshold: .long 0x0fffffc02 # -1022 + .long 0 + .quad 0 + +.L__enable_almost_inf: .quad 0x7fe0000000000000 + .quad 0 + +.L__real_zero: .quad 0x0000000000000000 + .quad 0 + +.L__real_smallest_denormal: .quad 0x0000000000000001 + .quad 0 +.L__denormal_tiny_threshold: .quad 0x0c0874046dfefd9d0 + .quad 0 + +.L__real_p65536: .quad 0x40f0000000000000 # 65536 + .quad 0 +.L__real_m68800: .quad 0x0c0f0cc0000000000 # -68800 + .quad 0 +.L__real_64_by_log2: .quad 0x40571547652b82fe # 64/ln(2) + .quad 0 +.L__real_log2_by_64_head: .quad 0x3f862e42f0000000 # log2_by_64_head + .quad 0 +.L__real_log2_by_64_tail: .quad 0x0bdfdf473de6af278 # -log2_by_64_tail + .quad 0 +.L__real_1_by_720: .quad 0x3f56c16c16c16c17 # 1/720 + .quad 0 +.L__real_1_by_120: .quad 0x3f81111111111111 # 1/120 + .quad 0 +.L__real_1_by_24: .quad 0x3fa5555555555555 # 1/24 + .quad 0 +.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6 + .quad 0 +.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2 + .quad 0 + +.align 16 +.L__two_to_jby64_head_table: + .quad 0x3ff0000000000000 + .quad 0x3ff02c9a30000000 + .quad 0x3ff059b0d0000000 + .quad 0x3ff0874510000000 + .quad 0x3ff0b55860000000 + .quad 0x3ff0e3ec30000000 + .quad 0x3ff11301d0000000 + .quad 0x3ff1429aa0000000 + .quad 0x3ff172b830000000 + .quad 0x3ff1a35be0000000 + .quad 0x3ff1d48730000000 + .quad 0x3ff2063b80000000 + .quad 0x3ff2387a60000000 + .quad 0x3ff26b4560000000 + .quad 0x3ff29e9df0000000 + .quad 0x3ff2d285a0000000 + .quad 0x3ff306fe00000000 + .quad 0x3ff33c08b0000000 + .quad 0x3ff371a730000000 + .quad 0x3ff3a7db30000000 + .quad 0x3ff3dea640000000 + .quad 0x3ff4160a20000000 + .quad 0x3ff44e0860000000 + .quad 0x3ff486a2b0000000 + .quad 0x3ff4bfdad0000000 + .quad 0x3ff4f9b270000000 + .quad 0x3ff5342b50000000 + .quad 0x3ff56f4730000000 + .quad 0x3ff5ab07d0000000 + .quad 0x3ff5e76f10000000 + .quad 0x3ff6247eb0000000 + .quad 0x3ff6623880000000 + .quad 0x3ff6a09e60000000 + .quad 0x3ff6dfb230000000 + .quad 0x3ff71f75e0000000 + .quad 0x3ff75feb50000000 + .quad 0x3ff7a11470000000 + .quad 0x3ff7e2f330000000 + .quad 0x3ff8258990000000 + .quad 0x3ff868d990000000 + .quad 0x3ff8ace540000000 + .quad 0x3ff8f1ae90000000 + .quad 0x3ff93737b0000000 + .quad 0x3ff97d8290000000 + .quad 0x3ff9c49180000000 + .quad 0x3ffa0c6670000000 + .quad 0x3ffa5503b0000000 + .quad 0x3ffa9e6b50000000 + .quad 0x3ffae89f90000000 + .quad 0x3ffb33a2b0000000 + .quad 0x3ffb7f76f0000000 + .quad 0x3ffbcc1e90000000 + .quad 0x3ffc199bd0000000 + .quad 0x3ffc67f120000000 + .quad 0x3ffcb720d0000000 + .quad 0x3ffd072d40000000 + .quad 0x3ffd5818d0000000 + .quad 0x3ffda9e600000000 + .quad 0x3ffdfc9730000000 + .quad 0x3ffe502ee0000000 + .quad 0x3ffea4afa0000000 + .quad 0x3ffefa1be0000000 + .quad 0x3fff507650000000 + .quad 0x3fffa7c180000000 + +.align 16 +.L__two_to_jby64_tail_table: + .quad 0x0000000000000000 + .quad 0x3e6cef00c1dcdef9 + .quad 0x3e48ac2ba1d73e2a + .quad 0x3e60eb37901186be + .quad 0x3e69f3121ec53172 + .quad 0x3e469e8d10103a17 + .quad 0x3df25b50a4ebbf1a + .quad 0x3e6d525bbf668203 + .quad 0x3e68faa2f5b9bef9 + .quad 0x3e66df96ea796d31 + .quad 0x3e368b9aa7805b80 + .quad 0x3e60c519ac771dd6 + .quad 0x3e6ceac470cd83f5 + .quad 0x3e5789f37495e99c + .quad 0x3e547f7b84b09745 + .quad 0x3e5b900c2d002475 + .quad 0x3e64636e2a5bd1ab + .quad 0x3e4320b7fa64e430 + .quad 0x3e5ceaa72a9c5154 + .quad 0x3e53967fdba86f24 + .quad 0x3e682468446b6824 + .quad 0x3e3f72e29f84325b + .quad 0x3e18624b40c4dbd0 + .quad 0x3e5704f3404f068e + .quad 0x3e54d8a89c750e5e + .quad 0x3e5a74b29ab4cf62 + .quad 0x3e5a753e077c2a0f + .quad 0x3e5ad49f699bb2c0 + .quad 0x3e6a90a852b19260 + .quad 0x3e56b48521ba6f93 + .quad 0x3e0d2ac258f87d03 + .quad 0x3e42a91124893ecf + .quad 0x3e59fcef32422cbe + .quad 0x3e68ca345de441c5 + .quad 0x3e61d8bee7ba46e1 + .quad 0x3e59099f22fdba6a + .quad 0x3e4f580c36bea881 + .quad 0x3e5b3d398841740a + .quad 0x3e62999c25159f11 + .quad 0x3e668925d901c83b + .quad 0x3e415506dadd3e2a + .quad 0x3e622aee6c57304e + .quad 0x3e29b8bc9e8a0387 + .quad 0x3e6fbc9c9f173d24 + .quad 0x3e451f8480e3e235 + .quad 0x3e66bbcac96535b5 + .quad 0x3e41f12ae45a1224 + .quad 0x3e55e7f6fd0fac90 + .quad 0x3e62b5a75abd0e69 + .quad 0x3e609e2bf5ed7fa1 + .quad 0x3e47daf237553d84 + .quad 0x3e12f074891ee83d + .quad 0x3e6b0aa538444196 + .quad 0x3e6cafa29694426f + .quad 0x3e69df20d22a0797 + .quad 0x3e640f12f71a1e45 + .quad 0x3e69f7490e4bb40b + .quad 0x3e4ed9942b84600d + .quad 0x3e4bdcdaf5cb4656 + .quad 0x3e5e2cffd89cf44c + .quad 0x3e452486cc2c7b9d + .quad 0x3e6cc2b44eee3fa4 + .quad 0x3e66dc8a80ce9f09 + .quad 0x3e39e90d82e90a7e + + +#endif
diff --git a/src/gas/powf.S b/src/gas/powf.S new file mode 100644 index 0000000..96eefd2 --- /dev/null +++ b/src/gas/powf.S
@@ -0,0 +1,1040 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# powf.S +# +# An implementation of the powf libm function. +# +# Prototype: +# +# float powf(float x, float y); +# + +# +# Algorithm: +# x^y = e^(y*ln(x)) +# +# Look in exp, log for the respective algorithms +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(powf) +#define fname_special _powf_special@PLT + + +# local variable storage offsets +.equ save_x, 0x0 +.equ save_y, 0x10 +.equ p_temp_exp, 0x20 +.equ negate_result, 0x30 +.equ save_ax, 0x40 +.equ y_head, 0x50 +.equ p_temp_log, 0x60 +.equ stack_size, 0x78 + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + sub $stack_size, %rsp + + movss %xmm0, save_x(%rsp) + movss %xmm1, save_y(%rsp) + + mov save_x(%rsp), %edx + mov save_y(%rsp), %r8d + + mov .L__f32_exp_mant_mask(%rip), %r10d + and %r8d, %r10d + jz .L__y_is_zero + + cmp .L__f32_pos_one(%rip), %r8d + je .L__y_is_one + + mov .L__f32_sign_mask(%rip), %r9d + and %edx, %r9d + cmp .L__f32_sign_mask(%rip), %r9d + mov .L__f32_pos_zero(%rip), %eax + mov %eax, negate_result(%rsp) + je .L__x_is_neg + + cmp .L__f32_pos_one(%rip), %edx + je .L__x_is_pos_one + + cmp .L__f32_pos_zero(%rip), %edx + je .L__x_is_zero + + mov .L__f32_exp_mask(%rip), %r9d + and %edx, %r9d + cmp .L__f32_exp_mask(%rip), %r9d + je .L__x_is_inf_or_nan + + mov .L__f32_exp_mask(%rip), %r10d + and %r8d, %r10d + cmp .L__f32_ay_max_bound(%rip), %r10d + jg .L__ay_is_very_large + + mov .L__f32_exp_mask(%rip), %r10d + and %r8d, %r10d + cmp .L__f32_ay_min_bound(%rip), %r10d + jl .L__ay_is_very_small + + # ----------------------------- + # compute log(x) here + # ----------------------------- +.L__log_x: + + movss save_y(%rsp), %xmm7 + cvtss2sd %xmm0, %xmm0 + cvtss2sd %xmm7, %xmm7 + movsd %xmm7, save_y(%rsp) + + # compute exponent part + xor %r8, %r8 + movdqa %xmm0, %xmm3 + psrlq $52, %xmm3 + movd %xmm0, %r8 + psubq .L__mask_1023(%rip), %xmm3 + movdqa %xmm0, %xmm2 + cvtdq2pd %xmm3, %xmm6 # xexp + pand .L__real_mant(%rip), %xmm2 + + # compute index into the log tables + mov %r8, %r9 + and .L__mask_mant_all7(%rip), %r8 + and .L__mask_mant8(%rip), %r9 + shl %r9 + add %r9, %r8 + mov %r8, p_temp_log(%rsp) + + # F, Y + movsd p_temp_log(%rsp), %xmm1 + shr $45, %r8 + por .L__real_half(%rip), %xmm2 + por .L__real_half(%rip), %xmm1 + lea .L__log_F_inv(%rip), %r9 + + # f = F - Y, r = f * inv + subsd %xmm2, %xmm1 + mulsd (%r9,%r8,8), %xmm1 + movsd %xmm1, %xmm2 + + lea .L__log_128_table(%rip), %r9 + movsd .L__real_log2(%rip), %xmm5 + movsd (%r9,%r8,8), %xmm0 + + # poly + mulsd %xmm2, %xmm1 + movsd .L__real_1_over_4(%rip), %xmm4 + movsd .L__real_1_over_2(%rip), %xmm3 + mulsd %xmm2, %xmm4 + mulsd %xmm2, %xmm3 + mulsd %xmm2, %xmm1 + addsd .L__real_1_over_3(%rip), %xmm4 + addsd .L__real_1_over_1(%rip), %xmm3 + mulsd %xmm1, %xmm4 + mulsd %xmm2, %xmm3 + addsd %xmm4, %xmm3 + + mulsd %xmm6, %xmm5 + subsd %xmm3, %xmm0 + addsd %xmm5, %xmm0 + + movsd save_y(%rsp), %xmm7 + mulsd %xmm7, %xmm0 + + # v = y * ln(x) + # xmm0 - v + + # ----------------------------- + # compute exp( y * ln(x) ) here + # ----------------------------- + + # x * (32/ln(2)) + movsd .L__real_32_by_log2(%rip), %xmm7 + movsd %xmm0, p_temp_exp(%rsp) + mulsd %xmm0, %xmm7 + mov p_temp_exp(%rsp), %rdx + + # v < 128*ln(2), ( v * (32/ln(2)) ) < 32*128 + # v >= -150*ln(2), ( v * (32/ln(2)) ) >= 32*(-150) + comisd .L__real_p4096(%rip), %xmm7 + jae .L__process_result_inf + + comisd .L__real_m4768(%rip), %xmm7 + jb .L__process_result_zero + + # n = int( v * (32/ln(2)) ) + cvtpd2dq %xmm7, %xmm4 + lea .L__two_to_jby32_table(%rip), %r10 + cvtdq2pd %xmm4, %xmm1 + + # r = x - n * ln(2)/32 + movsd .L__real_log2_by_32(%rip), %xmm2 + mulsd %xmm1, %xmm2 + movd %xmm4, %ecx + mov $0x1f, %rax + and %ecx, %eax + subsd %xmm2, %xmm0 + movsd %xmm0, %xmm1 + + # m = (n - j) / 32 + sub %eax, %ecx + sar $5, %ecx + + # q + mulsd %xmm0, %xmm1 + movsd .L__real_1_by_24(%rip), %xmm4 + movsd .L__real_1_by_2(%rip), %xmm3 + mulsd %xmm0, %xmm4 + mulsd %xmm0, %xmm3 + mulsd %xmm0, %xmm1 + addsd .L__real_1_by_6(%rip), %xmm4 + addsd .L__real_1_by_1(%rip), %xmm3 + mulsd %xmm1, %xmm4 + mulsd %xmm0, %xmm3 + addsd %xmm4, %xmm3 + movsd %xmm3, %xmm0 + + add $1023, %rcx + shl $52, %rcx + + # (f)*(1+q) + movsd (%r10,%rax,8), %xmm1 + mulsd %xmm1, %xmm0 + addsd %xmm1, %xmm0 + + mov %rcx, p_temp_exp(%rsp) + mulsd p_temp_exp(%rsp), %xmm0 + cvtsd2ss %xmm0, %xmm0 + orps negate_result(%rsp), %xmm0 + +.L__final_check: + add $stack_size, %rsp + ret + +.p2align 4,,15 +.L__process_result_zero: + mov .L__f32_real_zero(%rip), %r11d + or negate_result(%rsp), %r11d + jmp .L__z_is_zero_or_inf + +.p2align 4,,15 +.L__process_result_inf: + mov .L__f32_real_inf(%rip), %r11d + or negate_result(%rsp), %r11d + jmp .L__z_is_zero_or_inf + + +.p2align 4,,15 +.L__x_is_neg: + + mov .L__f32_exp_mask(%rip), %r10d + and %r8d, %r10d + cmp .L__f32_ay_max_bound(%rip), %r10d + jg .L__ay_is_very_large + + # determine if y is an integer + mov .L__f32_exp_mant_mask(%rip), %r10d + and %r8d, %r10d + mov %r10d, %r11d + mov .L__f32_exp_shift(%rip), %ecx + shr %cl, %r10d + sub .L__f32_exp_bias(%rip), %r10d + js .L__x_is_neg_y_is_not_int + + mov .L__f32_exp_mant_mask(%rip), %eax + and %edx, %eax + mov %eax, save_ax(%rsp) + + cmp .L__yexp_24(%rip), %r10d + mov %r10d, %ecx + jg .L__continue_after_y_int_check + + mov .L__f32_mant_full(%rip), %r9d + shr %cl, %r9d + and %r11d, %r9d + jnz .L__x_is_neg_y_is_not_int + + mov .L__f32_1_before_mant(%rip), %r9d + shr %cl, %r9d + and %r11d, %r9d + jz .L__continue_after_y_int_check + + mov .L__f32_sign_mask(%rip), %eax + mov %eax, negate_result(%rsp) + +.L__continue_after_y_int_check: + + cmp .L__f32_neg_zero(%rip), %edx + je .L__x_is_zero + + cmp .L__f32_neg_one(%rip), %edx + je .L__x_is_neg_one + + mov .L__f32_exp_mask(%rip), %r9d + and %edx, %r9d + cmp .L__f32_exp_mask(%rip), %r9d + je .L__x_is_inf_or_nan + + movss save_ax(%rsp), %xmm0 + jmp .L__log_x + +.p2align 4,,15 +.L__x_is_pos_one: + xor %eax, %eax + mov .L__f32_exp_mask(%rip), %r10d + and %r8d, %r10d + cmp .L__f32_exp_mask(%rip), %r10d + cmove %r8d, %eax + mov .L__f32_mant_mask(%rip), %r10d + and %eax, %r10d + jz .L__final_check + + mov .L__f32_qnan_set(%rip), %r10d + and %r8d, %r10d + jnz .L__final_check + + movss save_x(%rsp), %xmm0 + movss save_y(%rsp), %xmm1 + movss .L__f32_pos_one(%rip), %xmm2 + mov .L__flag_x_one_y_snan(%rip), %edi + + call fname_special + jmp .L__final_check + +.p2align 4,,15 +.L__y_is_zero: + + xor %eax, %eax + mov .L__f32_exp_mask(%rip), %r9d + mov .L__f32_pos_one(%rip), %r11d + and %edx, %r9d + cmp .L__f32_exp_mask(%rip), %r9d + cmove %edx, %eax + mov .L__f32_mant_mask(%rip), %r9d + and %eax, %r9d + jnz .L__x_is_nan + + movss .L__f32_pos_one(%rip), %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__y_is_one: + xor %eax, %eax + mov %edx, %r11d + mov .L__f32_exp_mask(%rip), %r9d + or .L__f32_qnan_set(%rip), %r11d + and %edx, %r9d + cmp .L__f32_exp_mask(%rip), %r9d + cmove %edx, %eax + mov .L__f32_mant_mask(%rip), %r9d + and %eax, %r9d + jnz .L__x_is_nan + + movd %edx, %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_neg_one: + mov .L__f32_pos_one(%rip), %edx + or negate_result(%rsp), %edx + xor %eax, %eax + mov %r8d, %r11d + mov .L__f32_exp_mask(%rip), %r10d + or .L__f32_qnan_set(%rip), %r11d + and %r8d, %r10d + cmp .L__f32_exp_mask(%rip), %r10d + cmove %r8d, %eax + mov .L__f32_mant_mask(%rip), %r10d + and %eax, %r10d + jnz .L__y_is_nan + + movd %edx, %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_neg_y_is_not_int: + mov .L__f32_exp_mask(%rip), %r9d + and %edx, %r9d + cmp .L__f32_exp_mask(%rip), %r9d + je .L__x_is_inf_or_nan + + cmp .L__f32_neg_zero(%rip), %edx + je .L__x_is_zero + + movss save_x(%rsp), %xmm0 + movss save_y(%rsp), %xmm1 + movss .L__f32_qnan(%rip), %xmm2 + mov .L__flag_x_neg_y_notint(%rip), %edi + + call fname_special + jmp .L__final_check + +.p2align 4,,15 +.L__ay_is_very_large: + mov .L__f32_exp_mask(%rip), %r9d + and %edx, %r9d + cmp .L__f32_exp_mask(%rip), %r9d + je .L__x_is_inf_or_nan + + mov .L__f32_exp_mant_mask(%rip), %r9d + and %edx, %r9d + jz .L__x_is_zero + + cmp .L__f32_neg_one(%rip), %edx + je .L__x_is_neg_one + + mov %edx, %r9d + and .L__f32_exp_mant_mask(%rip), %r9d + cmp .L__f32_pos_one(%rip), %r9d + jl .L__ax_lt1_y_is_large_or_inf_or_nan + + jmp .L__ax_gt1_y_is_large_or_inf_or_nan + +.p2align 4,,15 +.L__x_is_zero: + mov .L__f32_exp_mask(%rip), %r10d + xor %eax, %eax + and %r8d, %r10d + cmp .L__f32_exp_mask(%rip), %r10d + je .L__x_is_zero_y_is_inf_or_nan + + mov .L__f32_sign_mask(%rip), %r10d + and %r8d, %r10d + cmovnz .L__f32_pos_inf(%rip), %eax + jnz .L__x_is_zero_z_is_inf + + movd %eax, %xmm0 + orps negate_result(%rsp), %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_zero_z_is_inf: + + movss save_x(%rsp), %xmm0 + movss save_y(%rsp), %xmm1 + movd %eax, %xmm2 + orps negate_result(%rsp), %xmm2 + mov .L__flag_x_zero_z_inf(%rip), %edi + + call fname_special + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_zero_y_is_inf_or_nan: + mov %r8d, %r11d + cmp .L__f32_neg_inf(%rip), %r8d + cmove .L__f32_pos_inf(%rip), %eax + je .L__x_is_zero_z_is_inf + + or .L__f32_qnan_set(%rip), %r11d + mov .L__f32_mant_mask(%rip), %r10d + and %r8d, %r10d + jnz .L__y_is_nan + + movd %eax, %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_inf_or_nan: + xor %r11d, %r11d + mov .L__f32_sign_mask(%rip), %r10d + and %r8d, %r10d + cmovz .L__f32_pos_inf(%rip), %r11d + mov %edx, %eax + mov .L__f32_mant_mask(%rip), %r9d + or .L__f32_qnan_set(%rip), %eax + and %edx, %r9d + cmovnz %eax, %r11d + jnz .L__x_is_nan + + xor %eax, %eax + mov %r8d, %r9d + mov .L__f32_exp_mask(%rip), %r10d + or .L__f32_qnan_set(%rip), %r9d + and %r8d, %r10d + cmp .L__f32_exp_mask(%rip), %r10d + cmove %r8d, %eax + mov .L__f32_mant_mask(%rip), %r10d + and %eax, %r10d + cmovnz %r9d, %r11d + jnz .L__y_is_nan + + movd %r11d, %xmm0 + orps negate_result(%rsp), %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__ay_is_very_small: + movss .L__f32_pos_one(%rip), %xmm0 + addss %xmm1, %xmm0 + jmp .L__final_check + + +.p2align 4,,15 +.L__ax_lt1_y_is_large_or_inf_or_nan: + xor %r11d, %r11d + mov .L__f32_sign_mask(%rip), %r10d + and %r8d, %r10d + cmovnz .L__f32_pos_inf(%rip), %r11d + jmp .L__adjust_for_nan + +.p2align 4,,15 +.L__ax_gt1_y_is_large_or_inf_or_nan: + xor %r11d, %r11d + mov .L__f32_sign_mask(%rip), %r10d + and %r8d, %r10d + cmovz .L__f32_pos_inf(%rip), %r11d + +.p2align 4,,15 +.L__adjust_for_nan: + + xor %eax, %eax + mov %r8d, %r9d + mov .L__f32_exp_mask(%rip), %r10d + or .L__f32_qnan_set(%rip), %r9d + and %r8d, %r10d + cmp .L__f32_exp_mask(%rip), %r10d + cmove %r8d, %eax + mov .L__f32_mant_mask(%rip), %r10d + and %eax, %r10d + cmovnz %r9d, %r11d + jnz .L__y_is_nan + + test %eax, %eax + jnz .L__y_is_inf + +.p2align 4,,15 +.L__z_is_zero_or_inf: + + mov .L__flag_z_zero(%rip), %edi + test %r11d, %r11d + cmovnz .L__flag_z_inf(%rip), %edi + + movss save_x(%rsp), %xmm0 + movss save_y(%rsp), %xmm1 + movd %r11d, %xmm2 + + call fname_special + jmp .L__final_check + +.p2align 4,,15 +.L__y_is_inf: + + movd %r11d, %xmm0 + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_nan: + + xor %eax, %eax + mov .L__f32_exp_mask(%rip), %r10d + and %r8d, %r10d + cmp .L__f32_exp_mask(%rip), %r10d + cmove %r8d, %eax + mov .L__f32_mant_mask(%rip), %r10d + and %eax, %r10d + jnz .L__x_is_nan_y_is_nan + + mov .L__f32_qnan_set(%rip), %r9d + and %edx, %r9d + movd %r11d, %xmm0 + jnz .L__final_check + + movss save_x(%rsp), %xmm0 + movss save_y(%rsp), %xmm1 + movd %r11d, %xmm2 + mov .L__flag_x_nan(%rip), %edi + + call fname_special + jmp .L__final_check + +.p2align 4,,15 +.L__y_is_nan: + + mov .L__f32_qnan_set(%rip), %r10d + and %r8d, %r10d + movd %r11d, %xmm0 + jnz .L__final_check + + movss save_x(%rsp), %xmm0 + movss save_y(%rsp), %xmm1 + movd %r11d, %xmm2 + mov .L__flag_y_nan(%rip), %edi + + call fname_special + jmp .L__final_check + +.p2align 4,,15 +.L__x_is_nan_y_is_nan: + + mov .L__f32_qnan_set(%rip), %r9d + and %edx, %r9d + jz .L__continue_xy_nan + + mov .L__f32_qnan_set(%rip), %r10d + and %r8d, %r10d + jz .L__continue_xy_nan + + movd %r11d, %xmm0 + jmp .L__final_check + +.L__continue_xy_nan: + movss save_x(%rsp), %xmm0 + movss save_y(%rsp), %xmm1 + movd %r11d, %xmm2 + mov .L__flag_x_nan_y_nan(%rip), %edi + + call fname_special + jmp .L__final_check + +.data + +.align 16 + +# these codes and the ones in the corresponding .c file have to match +.L__flag_x_one_y_snan: .long 1 +.L__flag_x_zero_z_inf: .long 2 +.L__flag_x_nan: .long 3 +.L__flag_y_nan: .long 4 +.L__flag_x_nan_y_nan: .long 5 +.L__flag_x_neg_y_notint: .long 6 +.L__flag_z_zero: .long 7 +.L__flag_z_denormal: .long 8 +.L__flag_z_inf: .long 9 + +.align 16 + +.L__f32_ay_max_bound: .long 0x4f000000 +.L__f32_ay_min_bound: .long 0x2e800000 +.L__f32_sign_mask: .long 0x80000000 +.L__f32_sign_and_exp_mask: .long 0x0ff800000 +.L__f32_exp_mask: .long 0x7f800000 +.L__f32_neg_inf: .long 0x0ff800000 +.L__f32_pos_inf: .long 0x7f800000 +.L__f32_pos_one: .long 0x3f800000 +.L__f32_pos_zero: .long 0x00000000 +.L__f32_exp_mant_mask: .long 0x7fffffff +.L__f32_mant_mask: .long 0x007fffff + +.L__f32_neg_qnan: .long 0x0ffc00000 +.L__f32_qnan: .long 0x7fc00000 +.L__f32_qnan_set: .long 0x00400000 + +.L__f32_neg_one: .long 0x0bf800000 +.L__f32_neg_zero: .long 0x80000000 + +.L__f32_real_one: .long 0x3f800000 +.L__f32_real_zero: .long 0x00000000 +.L__f32_real_inf: .long 0x7f800000 + +.L__yexp_24: .long 0x00000018 + +.L__f32_exp_shift: .long 0x00000017 +.L__f32_exp_bias: .long 0x0000007f +.L__f32_mant_full: .long 0x007fffff +.L__f32_1_before_mant: .long 0x00800000 + +.align 16 + +.L__mask_mant_all7: .quad 0x000fe00000000000 +.L__mask_mant8: .quad 0x0000100000000000 + +#--------------------- +# log data +#--------------------- + +.align 16 + +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0000000000000000 +.L__real_inf: .quad 0x7ff0000000000000 # +inf + .quad 0x0000000000000000 +.L__real_nan: .quad 0x7ff8000000000000 # NaN + .quad 0x0000000000000000 +.L__real_mant: .quad 0x000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000000000000000 +.L__mask_1023: .quad 0x00000000000003ff + .quad 0x0000000000000000 + + +.L__real_log2: .quad 0x3fe62e42fefa39ef + .quad 0x0000000000000000 + +.L__real_two: .quad 0x4000000000000000 # 2 + .quad 0x0000000000000000 + +.L__real_one: .quad 0x3ff0000000000000 # 1 + .quad 0x0000000000000000 + +.L__real_half: .quad 0x3fe0000000000000 # 1/2 + .quad 0x0000000000000000 + +.L__real_1_over_1: .quad 0x3ff0000000000000 + .quad 0x0000000000000000 +.L__real_1_over_2: .quad 0x3fe0000000000000 + .quad 0x0000000000000000 +.L__real_1_over_3: .quad 0x3fd5555555555555 + .quad 0x0000000000000000 +.L__real_1_over_4: .quad 0x3fd0000000000000 + .quad 0x0000000000000000 + + +.align 16 +.L__log_128_table: + .quad 0x0000000000000000 + .quad 0x3f7fe02a6b106789 + .quad 0x3f8fc0a8b0fc03e4 + .quad 0x3f97b91b07d5b11b + .quad 0x3f9f829b0e783300 + .quad 0x3fa39e87b9febd60 + .quad 0x3fa77458f632dcfc + .quad 0x3fab42dd711971bf + .quad 0x3faf0a30c01162a6 + .quad 0x3fb16536eea37ae1 + .quad 0x3fb341d7961bd1d1 + .quad 0x3fb51b073f06183f + .quad 0x3fb6f0d28ae56b4c + .quad 0x3fb8c345d6319b21 + .quad 0x3fba926d3a4ad563 + .quad 0x3fbc5e548f5bc743 + .quad 0x3fbe27076e2af2e6 + .quad 0x3fbfec9131dbeabb + .quad 0x3fc0d77e7cd08e59 + .quad 0x3fc1b72ad52f67a0 + .quad 0x3fc29552f81ff523 + .quad 0x3fc371fc201e8f74 + .quad 0x3fc44d2b6ccb7d1e + .quad 0x3fc526e5e3a1b438 + .quad 0x3fc5ff3070a793d4 + .quad 0x3fc6d60fe719d21d + .quad 0x3fc7ab890210d909 + .quad 0x3fc87fa06520c911 + .quad 0x3fc9525a9cf456b4 + .quad 0x3fca23bc1fe2b563 + .quad 0x3fcaf3c94e80bff3 + .quad 0x3fcbc286742d8cd6 + .quad 0x3fcc8ff7c79a9a22 + .quad 0x3fcd5c216b4fbb91 + .quad 0x3fce27076e2af2e6 + .quad 0x3fcef0adcbdc5936 + .quad 0x3fcfb9186d5e3e2b + .quad 0x3fd0402594b4d041 + .quad 0x3fd0a324e27390e3 + .quad 0x3fd1058bf9ae4ad5 + .quad 0x3fd1675cababa60e + .quad 0x3fd1c898c16999fb + .quad 0x3fd22941fbcf7966 + .quad 0x3fd2895a13de86a3 + .quad 0x3fd2e8e2bae11d31 + .quad 0x3fd347dd9a987d55 + .quad 0x3fd3a64c556945ea + .quad 0x3fd404308686a7e4 + .quad 0x3fd4618bc21c5ec2 + .quad 0x3fd4be5f957778a1 + .quad 0x3fd51aad872df82d + .quad 0x3fd5767717455a6c + .quad 0x3fd5d1bdbf5809ca + .quad 0x3fd62c82f2b9c795 + .quad 0x3fd686c81e9b14af + .quad 0x3fd6e08eaa2ba1e4 + .quad 0x3fd739d7f6bbd007 + .quad 0x3fd792a55fdd47a2 + .quad 0x3fd7eaf83b82afc3 + .quad 0x3fd842d1da1e8b17 + .quad 0x3fd89a3386c1425b + .quad 0x3fd8f11e873662c8 + .quad 0x3fd947941c2116fb + .quad 0x3fd99d958117e08b + .quad 0x3fd9f323ecbf984c + .quad 0x3fda484090e5bb0a + .quad 0x3fda9cec9a9a084a + .quad 0x3fdaf1293247786b + .quad 0x3fdb44f77bcc8f63 + .quad 0x3fdb9858969310fb + .quad 0x3fdbeb4d9da71b7c + .quad 0x3fdc3dd7a7cdad4d + .quad 0x3fdc8ff7c79a9a22 + .quad 0x3fdce1af0b85f3eb + .quad 0x3fdd32fe7e00ebd5 + .quad 0x3fdd83e7258a2f3e + .quad 0x3fddd46a04c1c4a1 + .quad 0x3fde24881a7c6c26 + .quad 0x3fde744261d68788 + .quad 0x3fdec399d2468cc0 + .quad 0x3fdf128f5faf06ed + .quad 0x3fdf6123fa7028ac + .quad 0x3fdfaf588f78f31f + .quad 0x3fdffd2e0857f498 + .quad 0x3fe02552a5a5d0ff + .quad 0x3fe04bdf9da926d2 + .quad 0x3fe0723e5c1cdf40 + .quad 0x3fe0986f4f573521 + .quad 0x3fe0be72e4252a83 + .quad 0x3fe0e44985d1cc8c + .quad 0x3fe109f39e2d4c97 + .quad 0x3fe12f719593efbc + .quad 0x3fe154c3d2f4d5ea + .quad 0x3fe179eabbd899a1 + .quad 0x3fe19ee6b467c96f + .quad 0x3fe1c3b81f713c25 + .quad 0x3fe1e85f5e7040d0 + .quad 0x3fe20cdcd192ab6e + .quad 0x3fe23130d7bebf43 + .quad 0x3fe2555bce98f7cb + .quad 0x3fe2795e1289b11b + .quad 0x3fe29d37fec2b08b + .quad 0x3fe2c0e9ed448e8c + .quad 0x3fe2e47436e40268 + .quad 0x3fe307d7334f10be + .quad 0x3fe32b1339121d71 + .quad 0x3fe34e289d9ce1d3 + .quad 0x3fe37117b54747b6 + .quad 0x3fe393e0d3562a1a + .quad 0x3fe3b68449fffc23 + .quad 0x3fe3d9026a7156fb + .quad 0x3fe3fb5b84d16f42 + .quad 0x3fe41d8fe84672ae + .quad 0x3fe43f9fe2f9ce67 + .quad 0x3fe4618bc21c5ec2 + .quad 0x3fe48353d1ea88df + .quad 0x3fe4a4f85db03ebb + .quad 0x3fe4c679afccee3a + .quad 0x3fe4e7d811b75bb1 + .quad 0x3fe50913cc01686b + .quad 0x3fe52a2d265bc5ab + .quad 0x3fe54b2467999498 + .quad 0x3fe56bf9d5b3f399 + .quad 0x3fe58cadb5cd7989 + .quad 0x3fe5ad404c359f2d + .quad 0x3fe5cdb1dc6c1765 + .quad 0x3fe5ee02a9241675 + .quad 0x3fe60e32f44788d9 + .quad 0x3fe62e42fefa39ef + +.align 16 +.L__log_F_inv: + .quad 0x4000000000000000 + .quad 0x3fffc07f01fc07f0 + .quad 0x3fff81f81f81f820 + .quad 0x3fff44659e4a4271 + .quad 0x3fff07c1f07c1f08 + .quad 0x3ffecc07b301ecc0 + .quad 0x3ffe9131abf0b767 + .quad 0x3ffe573ac901e574 + .quad 0x3ffe1e1e1e1e1e1e + .quad 0x3ffde5d6e3f8868a + .quad 0x3ffdae6076b981db + .quad 0x3ffd77b654b82c34 + .quad 0x3ffd41d41d41d41d + .quad 0x3ffd0cb58f6ec074 + .quad 0x3ffcd85689039b0b + .quad 0x3ffca4b3055ee191 + .quad 0x3ffc71c71c71c71c + .quad 0x3ffc3f8f01c3f8f0 + .quad 0x3ffc0e070381c0e0 + .quad 0x3ffbdd2b899406f7 + .quad 0x3ffbacf914c1bad0 + .quad 0x3ffb7d6c3dda338b + .quad 0x3ffb4e81b4e81b4f + .quad 0x3ffb2036406c80d9 + .quad 0x3ffaf286bca1af28 + .quad 0x3ffac5701ac5701b + .quad 0x3ffa98ef606a63be + .quad 0x3ffa6d01a6d01a6d + .quad 0x3ffa41a41a41a41a + .quad 0x3ffa16d3f97a4b02 + .quad 0x3ff9ec8e951033d9 + .quad 0x3ff9c2d14ee4a102 + .quad 0x3ff999999999999a + .quad 0x3ff970e4f80cb872 + .quad 0x3ff948b0fcd6e9e0 + .quad 0x3ff920fb49d0e229 + .quad 0x3ff8f9c18f9c18fa + .quad 0x3ff8d3018d3018d3 + .quad 0x3ff8acb90f6bf3aa + .quad 0x3ff886e5f0abb04a + .quad 0x3ff8618618618618 + .quad 0x3ff83c977ab2bedd + .quad 0x3ff8181818181818 + .quad 0x3ff7f405fd017f40 + .quad 0x3ff7d05f417d05f4 + .quad 0x3ff7ad2208e0ecc3 + .quad 0x3ff78a4c8178a4c8 + .quad 0x3ff767dce434a9b1 + .quad 0x3ff745d1745d1746 + .quad 0x3ff724287f46debc + .quad 0x3ff702e05c0b8170 + .quad 0x3ff6e1f76b4337c7 + .quad 0x3ff6c16c16c16c17 + .quad 0x3ff6a13cd1537290 + .quad 0x3ff6816816816817 + .quad 0x3ff661ec6a5122f9 + .quad 0x3ff642c8590b2164 + .quad 0x3ff623fa77016240 + .quad 0x3ff6058160581606 + .quad 0x3ff5e75bb8d015e7 + .quad 0x3ff5c9882b931057 + .quad 0x3ff5ac056b015ac0 + .quad 0x3ff58ed2308158ed + .quad 0x3ff571ed3c506b3a + .quad 0x3ff5555555555555 + .quad 0x3ff5390948f40feb + .quad 0x3ff51d07eae2f815 + .quad 0x3ff5015015015015 + .quad 0x3ff4e5e0a72f0539 + .quad 0x3ff4cab88725af6e + .quad 0x3ff4afd6a052bf5b + .quad 0x3ff49539e3b2d067 + .quad 0x3ff47ae147ae147b + .quad 0x3ff460cbc7f5cf9a + .quad 0x3ff446f86562d9fb + .quad 0x3ff42d6625d51f87 + .quad 0x3ff4141414141414 + .quad 0x3ff3fb013fb013fb + .quad 0x3ff3e22cbce4a902 + .quad 0x3ff3c995a47babe7 + .quad 0x3ff3b13b13b13b14 + .quad 0x3ff3991c2c187f63 + .quad 0x3ff3813813813814 + .quad 0x3ff3698df3de0748 + .quad 0x3ff3521cfb2b78c1 + .quad 0x3ff33ae45b57bcb2 + .quad 0x3ff323e34a2b10bf + .quad 0x3ff30d190130d190 + .quad 0x3ff2f684bda12f68 + .quad 0x3ff2e025c04b8097 + .quad 0x3ff2c9fb4d812ca0 + .quad 0x3ff2b404ad012b40 + .quad 0x3ff29e4129e4129e + .quad 0x3ff288b01288b013 + .quad 0x3ff27350b8812735 + .quad 0x3ff25e22708092f1 + .quad 0x3ff2492492492492 + .quad 0x3ff23456789abcdf + .quad 0x3ff21fb78121fb78 + .quad 0x3ff20b470c67c0d9 + .quad 0x3ff1f7047dc11f70 + .quad 0x3ff1e2ef3b3fb874 + .quad 0x3ff1cf06ada2811d + .quad 0x3ff1bb4a4046ed29 + .quad 0x3ff1a7b9611a7b96 + .quad 0x3ff19453808ca29c + .quad 0x3ff1811811811812 + .quad 0x3ff16e0689427379 + .quad 0x3ff15b1e5f75270d + .quad 0x3ff1485f0e0acd3b + .quad 0x3ff135c81135c811 + .quad 0x3ff12358e75d3033 + .quad 0x3ff1111111111111 + .quad 0x3ff0fef010fef011 + .quad 0x3ff0ecf56be69c90 + .quad 0x3ff0db20a88f4696 + .quad 0x3ff0c9714fbcda3b + .quad 0x3ff0b7e6ec259dc8 + .quad 0x3ff0a6810a6810a7 + .quad 0x3ff0953f39010954 + .quad 0x3ff0842108421084 + .quad 0x3ff073260a47f7c6 + .quad 0x3ff0624dd2f1a9fc + .quad 0x3ff05197f7d73404 + .quad 0x3ff0410410410410 + .quad 0x3ff03091b51f5e1a + .quad 0x3ff0204081020408 + .quad 0x3ff0101010101010 + .quad 0x3ff0000000000000 + +#--------------------- +# exp data +#--------------------- + +.align 16 + +.L__real_zero: .quad 0x0000000000000000 + .quad 0 + +.L__real_p4096: .quad 0x40b0000000000000 + .quad 0 +.L__real_m4768: .quad 0x0c0b2a00000000000 + .quad 0 + +.L__real_32_by_log2: .quad 0x40471547652b82fe # 32/ln(2) + .quad 0 +.L__real_log2_by_32: .quad 0x3f962e42fefa39ef # log2_by_32 + .quad 0 + +.L__real_1_by_24: .quad 0x3fa5555555555555 # 1/24 + .quad 0 +.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6 + .quad 0 +.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2 + .quad 0 +.L__real_1_by_1: .quad 0x3ff0000000000000 # 1 + .quad 0 + +.align 16 + +.L__two_to_jby32_table: + .quad 0x3ff0000000000000 + .quad 0x3ff059b0d3158574 + .quad 0x3ff0b5586cf9890f + .quad 0x3ff11301d0125b51 + .quad 0x3ff172b83c7d517b + .quad 0x3ff1d4873168b9aa + .quad 0x3ff2387a6e756238 + .quad 0x3ff29e9df51fdee1 + .quad 0x3ff306fe0a31b715 + .quad 0x3ff371a7373aa9cb + .quad 0x3ff3dea64c123422 + .quad 0x3ff44e086061892d + .quad 0x3ff4bfdad5362a27 + .quad 0x3ff5342b569d4f82 + .quad 0x3ff5ab07dd485429 + .quad 0x3ff6247eb03a5585 + .quad 0x3ff6a09e667f3bcd + .quad 0x3ff71f75e8ec5f74 + .quad 0x3ff7a11473eb0187 + .quad 0x3ff82589994cce13 + .quad 0x3ff8ace5422aa0db + .quad 0x3ff93737b0cdc5e5 + .quad 0x3ff9c49182a3f090 + .quad 0x3ffa5503b23e255d + .quad 0x3ffae89f995ad3ad + .quad 0x3ffb7f76f2fb5e47 + .quad 0x3ffc199bdd85529c + .quad 0x3ffcb720dcef9069 + .quad 0x3ffd5818dcfba487 + .quad 0x3ffdfc97337b9b5f + .quad 0x3ffea4afa2a490da + .quad 0x3fff50765b6e4540 + +
diff --git a/src/gas/remainder.S b/src/gas/remainder.S new file mode 100644 index 0000000..173da80 --- /dev/null +++ b/src/gas/remainder.S
@@ -0,0 +1,256 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# remainder.S +# +# An implementation of the fabs libm function. +# +# Prototype: +# +# double remainder(double x,double y); +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(remainder) +#define fname_special _remainder_special + + +# local variable storage offsets +.equ temp_x, 0x0 +.equ temp_y, 0x10 +.equ stack_size, 0x28 + +.equ stack_size, 0x80 +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + movd %xmm0,%r8 + movd %xmm1,%r9 + movsd %xmm0,%xmm2 + movsd %xmm1,%xmm3 + movsd %xmm0,%xmm4 + movsd %xmm1,%xmm5 + mov .L__exp_mask_64(%rip), %r10 + and %r10,%r8 + and %r10,%r9 + xor %r10,%r10 + ror $52, %r8 + ror $52, %r9 + cmp $0,%r8 + jz .L__LargeExpDiffComputation + cmp $0,%r9 + jz .L__LargeExpDiffComputation + sub %r9,%r8 # + cmp $52,%r8 + jge .L__LargeExpDiffComputation + pand .L__Nan_64(%rip),%xmm4 + pand .L__Nan_64(%rip),%xmm5 + comisd %xmm5,%xmm4 + jp .L__InputIsNaN # if either of xmm1 or xmm0 is a NaN then + # parity flag is set + jz .L__Input_Is_Equal + jbe .L__ReturnImmediate + cmp $0x7FF,%r8 + jz .L__Dividend_Is_Infinity + + #calculation without using the x87 FPU +.L__DirectComputation: + movapd %xmm4,%xmm2 + movapd %xmm5,%xmm3 + divsd %xmm3,%xmm2 + cvttsd2siq %xmm2,%r8 + mov %r8,%r10 + and $0X01,%r10 + cvtsi2sdq %r8,%xmm2 + + #multiplication in QUAD Precision + #Since the below commented multiplication resulted in an error + #we had to implement a quad precision multiplication + #logic behind Quad Precision Multiplication + #x = hx + tx by setting x's last 27 bits to null + #y = hy + ty similar to x + movapd .L__27bit_andingmask_64(%rip),%xmm4 + movapd %xmm5,%xmm1 # x + movapd %xmm2,%xmm6 # y + movapd %xmm2,%xmm7 # z = xmm7 + mulpd %xmm5,%xmm7 # z = x*y + andpd %xmm4,%xmm1 + andpd %xmm4,%xmm2 + subsd %xmm1,%xmm5 # xmm1 = hx xmm5 = tx + subsd %xmm2,%xmm6 # xmm2 = hy xmm6 = ty + + movapd %xmm1,%xmm4 # copy hx + mulsd %xmm2,%xmm4 # xmm4 = hx*hy + subsd %xmm7,%xmm4 # xmm4 = (hx*hy - z) + mulsd %xmm6,%xmm1 # xmm1 = hx * ty + addsd %xmm1,%xmm4 # xmm4 = ((hx * hy - *z) + hx * ty) + mulsd %xmm5,%xmm2 # xmm2 = tx * hy + addsd %xmm2,%xmm4 # xmm4 = (((hx * hy - *z) + hx * ty) + tx * hy) + mulsd %xmm5,%xmm6 # xmm6 = tx * ty + addsd %xmm4,%xmm6 # xmm6 = (((hx * hy - *z) + hx * ty) + tx * hy) + tx * ty; + #xmm6 and xmm7 contain the quad precision result + #v = dx - c; + movapd %xmm0,%xmm1 # copy the input number + pand .L__Nan_64(%rip),%xmm1 + movapd %xmm1,%xmm2 # xmm2 = dx = xmm1 + subsd %xmm7,%xmm1 # v = dx - c + subsd %xmm1,%xmm2 # (dx - v) + subsd %xmm7,%xmm2 # ((dx - v) - c) + subsd %xmm6,%xmm2 # (((dx - v) - c) - cc) + addsd %xmm1,%xmm2 # xmm2 = dx = v + (((dx - v) - c) - cc) + # xmm3 = w + movapd %xmm2,%xmm4 + movapd %xmm3,%xmm5 + addsd %xmm4,%xmm4 # xmm4 = dx + dx + comisd %xmm4,%xmm3 # if (dx + dx > w) + jb .L__Substractw + mulpd .L__ZeroPointFive(%rip),%xmm5 # xmm5 = 0.5 * w + comisd %xmm2,%xmm5 # if (dx > 0.5 * w) + jb .L__Substractw + cmp $0x01,%r10 # If the quotient is an odd number + jnz .L__Finish + comisd %xmm4,%xmm3 #if (todd && (dx + dx == w)) then subtract w + jz .L__Substractw + comisd %xmm0,%xmm5 #if (todd && (dx == 0.5 * w)) then subtract w + jnz .L__Finish + +.L__Substractw: + subsd %xmm3,%xmm2 # dx -= w + +# The following code checks the sign of the input number and then calculate the return Value +# return x < 0.0? -dx : dx; +.L__Finish: + comisd .L__Zero_64(%rip), %xmm0 + ja .L__Not_Negative_Number1 + +.L__Negative_Number1: + movapd .L__Zero_64(%rip),%xmm0 + subsd %xmm2,%xmm0 + ret +.L__Not_Negative_Number1: + movapd %xmm2,%xmm0 + ret + + + #calculation using the x87 FPU + #For numbers whose exponent of either of the divisor, + #or dividends are 0. Or for numbers whose exponential + #diff is grater than 52 +.align 16 +.L__LargeExpDiffComputation: + sub $stack_size, %rsp + movsd %xmm0, temp_x(%rsp) + movsd %xmm1, temp_y(%rsp) + ffree %st(0) + ffree %st(1) + fldl temp_y(%rsp) + fldl temp_x(%rsp) + fnclex +.align 16 +.L__repeat: + fprem1 #Calculate remainder by dividing st(0) with st(1) + #fprem operation sets x87 condition codes, + #it will set the C2 code to 1 if a partial remainder is calculated + fnstsw %ax + and $0x0400,%ax # Stores Floating-Point Store Status Word into the accumulator + # we need to check only the C2 bit of the Condition codes + cmp $0x0400,%ax # Checks whether the bit 10(C2) is set or not + # IF its set then a partial remainder was calculated + jz .L__repeat + #store the result from the FPU stack to memory + fstpl temp_x(%rsp) + fstpl temp_y(%rsp) + movsd temp_x(%rsp), %xmm0 + add $stack_size, %rsp + ret + + #IF both the inputs are equal +.L__Input_Is_Equal: + cmp $0x7FF,%r8 + jz .L__Dividend_Is_Infinity + cmp $0x7FF,%r9 + jz .L__InputIsNaN + movsd %xmm0,%xmm1 + pand .L__sign_mask_64(%rip),%xmm1 + movsd .L__Zero_64(%rip),%xmm0 + por %xmm1,%xmm0 + ret + +.L__InputIsNaN: + por .L__QNaN_mask_64(%rip),%xmm0 + por .L__exp_mask_64(%rip),%xmm0 +.L__Dividend_Is_Infinity: + ret + +#Case when x < y +.L__ReturnImmediate: + movapd %xmm5,%xmm7 + mulpd .L__ZeroPointFive(%rip),%xmm5 # + comisd %xmm4,%xmm5 + jae .L__FoundResult1 + subsd %xmm7,%xmm4 + comisd .L__Zero_64(%rip),%xmm0 + ja .L__Not_Negative_Number +.L__Negative_Number: + movapd .L__Zero_64(%rip),%xmm0 + subsd %xmm4,%xmm0 + ret + +.L__Not_Negative_Number: + movapd %xmm4,%xmm0 + ret +.align 16 +.L__FoundResult1: + ret + + + +.align 32 +.L__sign_mask_64: .quad 0x8000000000000000 + .quad 0x0 +.L__exp_mask_64: .quad 0x7FF0000000000000 + .quad 0x0 +.L__27bit_andingmask_64: .quad 0xfffffffff8000000 + .quad 0 +.L__2p52_mask_64: .quad 0x4330000000000000 + .quad 0 +.L__Zero_64: .quad 0x0 + .quad 0 +.L__QNaN_mask_64: .quad 0x0008000000000000 + .quad 0 +.L__Nan_64: .quad 0x7FFFFFFFFFFFFFFF + .quad 0 +.L__ZeroPointFive: .quad 0X3FE0000000000000 + .quad 0 +
diff --git a/src/gas/remainderf.S b/src/gas/remainderf.S new file mode 100644 index 0000000..d196d11 --- /dev/null +++ b/src/gas/remainderf.S
@@ -0,0 +1,221 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# remainderf.S +# +# An implementation of the fabs libm function. +# +# Prototype: +# +# float remainderf(float x,float y); +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(remainderf) +#define fname_special _remainderf_special + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + mov .L__exp_mask_64(%rip), %rdi + movapd .L__sign_mask_64(%rip),%xmm6 + cvtss2sd %xmm0,%xmm2 # double x + cvtss2sd %xmm1,%xmm3 # double y + pand %xmm6,%xmm2 + pand %xmm6,%xmm3 + movd %xmm2,%rax + movd %xmm3,%r8 + mov %rax,%r11 + mov %r8,%r9 + movsd %xmm2,%xmm4 + #take the exponents of both x and y + and %rdi,%rax + and %rdi,%r8 + ror $52, %rax + ror $52, %r8 + #ifeither of the exponents is infinity + cmp $0X7FF,%rax + jz .L__InputIsNaN + cmp $0X7FF,%r8 + jz .L__InputIsNaNOrInf + + cmp $0,%r8 + jz .L__Divisor_Is_Zero + + cmp %r9, %r11 + jz .L__Input_Is_Equal + jb .L__ReturnImmediate + + xor %rcx,%rcx + mov $24,%rdx + movsd .L__One_64(%rip),%xmm7 # xmm7 = scale + cmp %rax,%r8 + jae .L__y_is_greater + #xmm3 = dy + sub %r8,%rax + div %dl # al = ntimes + mov %al,%cl # cl = ntimes + and $0xFF,%ax # set everything t o zero except al + mul %dl # ax = dl * al = 24* ntimes + add $1023, %rax + shl $52,%rax + movd %rax,%xmm7 # xmm7 = scale +.L__y_is_greater: + mulsd %xmm3,%xmm7 # xmm7 = scale * dy + movsd .L__2pminus24_decimal(%rip),%xmm6 + +.align 16 +.L__Start_Loop: + dec %cl + js .L__End_Loop + divsd %xmm7,%xmm4 # xmm7 = (dx / w) + cvttsd2siq %xmm4,%rax + cvtsi2sdq %rax,%xmm4 # xmm4 = t = (double)((int)(dx / w)) + mulsd %xmm7,%xmm4 # xmm4 = w*t + mulsd %xmm6,%xmm7 # w*= scale + subsd %xmm4,%xmm2 # xmm2 = dx -= w*t + movsd %xmm2,%xmm4 # xmm4 = dx + jmp .L__Start_Loop +.L__End_Loop: + divsd %xmm7,%xmm4 # xmm7 = (dx / w) + cvttsd2siq %xmm4,%rax + cvtsi2sdq %rax,%xmm4 # xmm4 = t = (double)((int)(dx / w)) + and $0x01,%rax # todd = todd = ((int)(dx / w)) & 1 + mulsd %xmm7,%xmm4 # xmm4 = w*t + subsd %xmm4,%xmm2 # xmm2 = dx -= w*t + movsd %xmm7,%xmm6 # store w + mulsd .L__Zero_Point_Five64(%rip),%xmm7 #xmm7 = 0.5*w + + cmp $0x01,%rax + jnz .L__todd_is_even + comisd %xmm2,%xmm7 + je .L__Subtract_w + +.L__todd_is_even: + comisd %xmm2,%xmm7 + jnb .L__Dont_Subtract_w + +.L__Subtract_w: + subsd %xmm6,%xmm2 + +.L__Dont_Subtract_w: + comiss .L__Zero_64(%rip),%xmm0 + jb .L__Negative + cvtsd2ss %xmm2,%xmm0 + ret +.L__Negative: + movsd .L__MinusZero_64(%rip),%xmm0 + subsd %xmm2,%xmm0 + cvtsd2ss %xmm0,%xmm0 + ret + +.align 16 +.L__Input_Is_Equal: + cmp $0x7FF,%rax + jz .L__Dividend_Is_Infinity + cmp $0x7FF,%r8 + jz .L__InputIsNaNOrInf + movsd %xmm0,%xmm1 + pand .L__sign_bit_32(%rip),%xmm1 + movss .L__Zero_64(%rip),%xmm0 + por %xmm1,%xmm0 + ret + +.L__InputIsNaNOrInf: + comiss %xmm0,%xmm1 + jp .L__InputIsNaN + ret +.L__Divisor_Is_Zero: +.L__InputIsNaN: + por .L__exp_mask_32(%rip),%xmm0 +.L__Dividend_Is_Infinity: + por .L__QNaN_mask_32(%rip),%xmm0 + ret + +#Case when x < y + #xmm2 = dx +.L__ReturnImmediate: + movsd %xmm3,%xmm5 + mulsd .L__Zero_Point_Five64(%rip), %xmm3 # xmm3 = 0.5*dy + comisd %xmm3,%xmm2 # if (dx > 0.5*dy) + jna .L__Finish_Immediate # xmm2 <= xmm3 + subsd %xmm5,%xmm2 #dx -= dy + +.L__Finish_Immediate: + comiss .L__Zero_64(%rip),%xmm0 + #xmm0 contains the input and is the result + jz .L__Zero + ja .L__Positive + + movsd .L__Zero_64(%rip),%xmm0 + subsd %xmm2,%xmm0 + cvtsd2ss %xmm0,%xmm0 + ret + +.L__Zero: + ret + +.L__Positive: + cvtsd2ss %xmm2,%xmm0 + ret + + + +.align 32 +.L__sign_bit_32: .quad 0x8000000080000000 + .quad 0x0 +.L__exp_mask_64: .quad 0x7FF0000000000000 + .quad 0x0 +.L__exp_mask_32: .quad 0x000000007F800000 + .quad 0x0 +.L__27bit_andingmask_64: .quad 0xfffffffff8000000 + .quad 0 +.L__2p52_mask_64: .quad 0x4330000000000000 + .quad 0 +.L__One_64: .quad 0x3FF0000000000000 + .quad 0 +.L__Zero_64: .quad 0x0 + .quad 0 +.L__MinusZero_64: .quad 0x8000000000000000 + .quad 0 +.L__QNaN_mask_32: .quad 0x0000000000400000 + .quad 0 +.L__sign_mask_64: .quad 0x7FFFFFFFFFFFFFFF + .quad 0 +.L__2pminus24_decimal: .quad 0x3E70000000000000 + .quad 0 +.L__Zero_Point_Five64: .quad 0x3FE0000000000000 + .quad 0 +
diff --git a/src/gas/round.S b/src/gas/round.S new file mode 100644 index 0000000..c1ac20a --- /dev/null +++ b/src/gas/round.S
@@ -0,0 +1,151 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# round.S +# +# An implementation of the round libm function. +# +# Prototype: +# +# double round(double x); +# + +# +# Algorithm: First get the exponent of the input +# double precision number. +# IF exponent is greater than 51 then return the +# input as is. +# IF exponent is less than 0 then force an overflow +# by adding a huge number and subtracting with the +# same number. +# IF exponent is greater than 0 then add 0.5 and +# and shift the mantissa bits based on the exponent +# value to discard the fractional component. +# +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(round) +#define fname_special _round_special + + +# local variable storage offsets + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +#in sse5 there is a roundss,roundsd instruction +fname: + movsd .L__2p52_plus_one(%rip),%xmm4 + movsd .L__sign_mask_64(%rip),%xmm5 + mov $52,%r10 + #take 3 copies of the input xmm0 + movsd %xmm0,%xmm1 + movsd %xmm0,%xmm2 + movsd %xmm0,%xmm3 + #get the Most signifacnt half word of the input number in r9 + pand .L__exp_mask_64(%rip), %xmm1 + pextrw $3,%xmm1,%r9 + cmp $0X7FF0,%r9 + #Check for infinity inputs + jz .L__is_infinity + movsd .L__sign_mask_64(%rip), %xmm1 + pandn %xmm2,%xmm1 # xmm1 now stores the sign of the input number + #On shifting r9 and subtracting with 0x3FF + #r9 stores the exponent. + shr $0X4,%r9 + sub $0x3FF,%r9 + cmp $0x00, %r9 + jl .L__number_less_than_zero + + #IF exponent is greater than 0 +.L__number_greater_than_zero: + cmp $51,%r9 + jg .L__is_greater_than_2p52 + + #IF exponent is greater than 0 and less than 2^52 + pand .L__sign_mask_64(%rip),%xmm0 + #add with 0.5 + addsd .L__zero_point_5(%rip),%xmm0 + movsd %xmm0,%xmm5 + + pand .L__exp_mask_64(%rip),%xmm5 + pand .L__mantissa_mask_64(%rip),%xmm0 + #r10 = r9(input exponent) - r10(52=mantissa length) + sub %r9,%r10 + movd %r10, %xmm2 + #do right and left shift by (input exp - mantissa length) + psrlq %xmm2,%xmm0 + psllq %xmm2,%xmm0 + #OR the input exponent with the input sign + por %xmm1,%xmm5 + #finally OR with the matissa + por %xmm5,%xmm0 + ret + + #IF exponent is less than 0 +.L__number_less_than_zero: + pand %xmm5,%xmm3 # xmm3 =abs(input) + addsd %xmm4,%xmm3# add (2^52 + 1) + subsd %xmm4,%xmm3# sub (2^52 + 1) + por %xmm1, %xmm3 # OR with the sign of the input number + movsd %xmm3,%xmm0 + ret + + #IF the input is infinity +.L__is_infinity: + comisd %xmm4,%xmm0 + jnp .L__is_zero #parity flag is raised + #IF one of theinputs is a Nan +.L__is_nan : + por .L__qnan_mask_64(%rip),%xmm0 # set the QNan Bit +.L__is_zero : +.L__is_greater_than_2p52: + ret + +.align 16 +.L__sign_mask_64: .quad 0x7FFFFFFFFFFFFFFF + .quad 0 + +.L__qnan_mask_64: .quad 0x0008000000000000 + .quad 0 +.L__exp_mask_64: .quad 0x7FF0000000000000 + .quad 0 +.L__mantissa_mask_64: .quad 0x000FFFFFFFFFFFFF + .quad 0 +.L__zero: .quad 0x0000000000000000 + .quad 0 +.L__2p52_plus_one: .quad 0x4330000000000001 # = 4503599627370497.0 + .quad 0 +.L__zero_point_5: .quad 0x3FE0000000000001 # = 00.5 + .quad 0 + + +
diff --git a/src/gas/sin.S b/src/gas/sin.S new file mode 100644 index 0000000..378e103 --- /dev/null +++ b/src/gas/sin.S
@@ -0,0 +1,481 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# +# An implementation of the sin function. +# +# Prototype: +# +# double sin(double x); +# +# Computes sin(x). +# It will provide proper C99 return values, +# but may not raise floating point status bits properly. +# Based on the NAG C implementation. +# +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 32 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0 # for alignment +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0 +.L__real_411E848000000000: .quad 0x415312d000000000 # 5e6 0x0411E848000000000 # 5e5 + .quad 0 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff # Sign bit zero + .quad 0 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0 + +.align 32 +.Lcosarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0 + .quad 0x0bf56c16c16c16967 # -0.00138889 c2 + .quad 0 + .quad 0x03EFA01A019F4EC91 # 2.48016e-005 c3 + .quad 0 + .quad 0x0bE927E4FA17F667B # -2.75573e-007 c4 + .quad 0 + .quad 0x03E21EEB690382EEC # 2.08761e-009 c5 + .quad 0 + .quad 0x0bDA907DB47258AA7 # -1.13826e-011 c6 + .quad 0 + +.align 32 +.Lsinarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0 + +.text +.align 32 +.p2align 4,,15 + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(sin) +#define fname_special _sin_special@PLT + +# define local variable storage offsets +.equ p_temp, 0x30 # temporary for get/put bits operation +.equ p_temp1, 0x40 # temporary for get/put bits operation +.equ r, 0x50 # pointer to r for amd_remainder_piby2 +.equ rr, 0x60 # pointer to rr for amd_remainder_piby2 +.equ region, 0x70 # pointer to region for amd_remainder_piby2 +.equ stack_size, 0x98 + +.globl fname +.type fname,@function + +fname: + sub $stack_size, %rsp + xorpd %xmm2, %xmm2 # zeroed out for later use + +# GET_BITS_DP64(x, ux); +# get the input value to an integer register. + movsd %xmm0, p_temp(%rsp) + mov p_temp(%rsp), %rdx # rdx is ux + +## if NaN or inf + mov $0x07ff0000000000000, %rax + mov %rax, %r10 + and %rdx, %r10 + cmp %rax, %r10 + jz .Lsin_naninf + +# ax = (ux & ~SIGNBIT_DP64); + mov $0x07fffffffffffffff, %r10 + and %rdx, %r10 # r10 is ax + mov $1, %r8d # for determining region later on + +## if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */ + mov $0x03fe921fb54442d18, %rax + cmp %rax, %r10 + jg .Lsin_reduce + +## if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */ + mov $0x03f20000000000000, %rax + cmp %rax, %r10 + jge .Lsin_small + +## if (ax < 0x3e40000000000000) /* abs(x) < 2.0^(-27) */ + mov $0x03e40000000000000, %rax + cmp %rax, %r10 + jge .Lsin_smaller + +# sin = 1.0; + jmp .Lsin_cleanup + +.align 32 +.Lsin_smaller: +# sin = x - x^3 * 0.1666666666666666666; + movsd %xmm0, %xmm2 + movsd .L__real_3fc5555555555555(%rip), %xmm4 # 0.1666666666666666666 + mulsd %xmm2, %xmm2 # x^2 + mulsd %xmm0, %xmm2 # x^3 + mulsd %xmm4, %xmm2 # x^3 * 0.1666666666666666666 + subsd %xmm2, %xmm0 # x - x^3 * 0.1666666666666666666 + jmp .Lsin_cleanup + +.align 32 +.Lsin_small: +# sin = sin_piby4(x, 0.0); + movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5 + +.Lsin_piby4_noreduce: + movsd %xmm0, %xmm2 + mulsd %xmm0, %xmm2 # x2 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# region 0 or 2 - do a sin calculation +# zs = (s2 + x2 * (s3 + x2 * (s4 + x2 * (s5 + x2 * s6)))); + movsd .Lsinarray+0x50(%rip), %xmm3 # s6 + mulsd %xmm2, %xmm3 # x2s6 + movsd .Lsinarray+0x20(%rip), %xmm5 # s3 + movsd %xmm2, %xmm1 # move for x4 + mulsd %xmm2, %xmm1 # x4 + mulsd %xmm2, %xmm5 # x2s3 + movsd %xmm0, %xmm4 # move for x3 + addsd .Lsinarray+0x40(%rip), %xmm3 # s5+x2s6 + mulsd %xmm2, %xmm1 # x6 + mulsd %xmm2, %xmm3 # x2(s5+x2s6) + mulsd %xmm2, %xmm4 # x3 + addsd .Lsinarray+0x10(%rip), %xmm5 # s2+x2s3 + mulsd %xmm2, %xmm5 # x2(s2+x2s3) + addsd .Lsinarray+0x30(%rip), %xmm3 # s4 + x2(s5+x2s6) + mulsd %xmm1, %xmm3 # x6(s4 + x2(s5+x2s6)) + addsd .Lsinarray(%rip), %xmm5 # s1+x2(s2+x2s3) + addsd %xmm5, %xmm3 # zs + mulsd %xmm3, %xmm4 # *x3 + addsd %xmm4, %xmm0 # +x + jmp .Lsin_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsin_reduce: +# xneg = (ax != ux); + cmp %r10, %rdx + mov $0, %r11d + +## if (xneg) x = -x; + jz .Lpositive + mov $1, %r11d + subsd %xmm0, %xmm2 + movsd %xmm2, %xmm0 + +.align 16 +.Lpositive: +## if (x < 5.0e5) + cmp .L__real_411E848000000000(%rip), %r10 + jae .Lsin_reduce_precise + +# reduce the argument to be in a range from -pi/4 to +pi/4 +# by subtracting multiples of pi/2 + movsd %xmm0, %xmm2 + movsd .L__real_3fe45f306dc9c883(%rip), %xmm3 # twobypi + movsd %xmm0, %xmm4 + movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5 + mulsd %xmm3, %xmm2 + +#/* How many pi/2 is x a multiple of? */ +# xexp = ax >> EXPSHIFTBITS_DP64; + mov %r10, %r9 + shr $52, %r9 # >>EXPSHIFTBITS_DP64 + +# npi2 = (int)(x * twobypi + 0.5); + addsd %xmm5, %xmm2 # npi2 + + movsd .L__real_3ff921fb54400000(%rip), %xmm3 # piby2_1 + cvttpd2dq %xmm2, %xmm0 # convert to integer + movsd .L__real_3dd0b4611a626331(%rip), %xmm1 # piby2_1tail + cvtdq2pd %xmm0, %xmm2 # and back to float. + +# /* Subtract the multiple from x to get an extra-precision remainder */ +# rhead = x - npi2 * piby2_1; + mulsd %xmm2, %xmm3 + subsd %xmm3, %xmm4 # rhead + +# rtail = npi2 * piby2_1tail; + mulsd %xmm2, %xmm1 + movd %xmm0, %eax + +# GET_BITS_DP64(rhead-rtail, uy); + movsd %xmm4, %xmm0 + subsd %xmm1, %xmm0 + + movsd .L__real_3dd0b4611a600000(%rip), %xmm3 # piby2_2 + movsd %xmm0,p_temp(%rsp) + movsd .L__real_3ba3198a2e037073(%rip), %xmm5 # piby2_2tail + mov p_temp(%rsp), %rcx # rcx is rhead-rtail + +# xmm0=r, xmm4=rhead, xmm1=rtail, xmm2=npi2, xmm3=temp for calc, xmm5= temp for calc +# expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + shl $1, %rcx # strip any sign bit + shr $53, %rcx # >> EXPSHIFTBITS_DP64 +1 + sub %rcx, %r9 # expdiff + +## if (expdiff > 15) + cmp $15, %r9 + jle .Lexplediff15 + +# /* The remainder is pretty small compared with x, which +# implies that x is a near multiple of pi/2 +# (x matches the multiple to at least 15 bits) */ + +# t = rhead; + movsd %xmm4, %xmm1 + +# rtail = npi2 * piby2_2; + mulsd %xmm2, %xmm3 + +# rhead = t - rtail; + mulsd %xmm2, %xmm5 # npi2 * piby2_2tail + subsd %xmm3, %xmm4 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + subsd %xmm4, %xmm1 # t - rhead + subsd %xmm3, %xmm1 # -rtail + subsd %xmm1, %xmm5 # rtail + +# r = rhead - rtail; + movsd %xmm4, %xmm0 + +#HARSHA +#xmm1=rtail + movsd %xmm5, %xmm1 + subsd %xmm5, %xmm0 + +# xmm0=r, xmm4=rhead, xmm1=rtail +.Lexplediff15: +# region = npi2 & 3; + + subsd %xmm0, %xmm4 # rhead-r + subsd %xmm1, %xmm4 # rr = (rhead-r) - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +## if the input was close to a pi/2 multiple +# The original NAG code missed this trick. If the input is very close to n*pi/2 after +# reduction, +# then the sin is ~ 1.0 , to within 53 bits, when r is < 2^-27. We already +# have x at this point, so we can skip the sin polynomials. + + cmp $0x03f2, %rcx # if r small. + jge .Lsin_piby4 # use taylor series if not + cmp $0x03de, %rcx # if r really small. + jle .Lr_small # then sin(r) = 0 + + movsd %xmm0, %xmm2 + mulsd %xmm2, %xmm2 # x^2 + +## if region is 0 or 2 do a sin calc. + and %eax, %r8d + jnz .Lcossmall + +# region 0 or 2 do a sin calculation +# use simply polynomial +# x - x*x*x*0.166666666666666666; + movsd .L__real_3fc5555555555555(%rip), %xmm3 + mulsd %xmm0, %xmm3 # * x + mulsd %xmm2, %xmm3 # * x^2 + subsd %xmm3, %xmm0 # xs + jmp .Ladjust_region + +.align 16 +.Lcossmall: +# region 1 or 3 do a cos calculation +# use simply polynomial +# 1.0 - x*x*0.5; + movsd .L__real_3ff0000000000000(%rip), %xmm0 # 1.0 + mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 *x^2 + subsd %xmm2, %xmm0 # xc + jmp .Ladjust_region + +.align 16 +.Lr_small: +## if region is 1 or 3 do a cos calc. + and %eax, %r8d + jz .Ladjust_region + +# odd + movsd .L__real_3ff0000000000000(%rip), %xmm0 # cos(r) is a 1 + jmp .Ladjust_region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 32 +.Lsin_reduce_precise: +# // Reduce x into range [-pi/4,pi/4] +# __amd_remainder_piby2(x, &r, &rr, ®ion); + + mov %r11,p_temp(%rsp) + lea region(%rsp), %rdx + lea rr(%rsp), %rsi + lea r(%rsp), %rdi + + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp), %r11 + mov $1, %r8d # for determining region later on + movsd r(%rsp), %xmm0 # x + movsd rr(%rsp), %xmm4 # xx + mov region(%rsp), %eax # region + +# xmm0 = x, xmm4 = xx, r8d = 1, eax= region +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +# perform taylor series to calc sinx, sinx +.Lsin_piby4: +# x2 = r * r; + +#xmm4 = a part of rr for the sin path, xmm4 is overwritten in the sin path +#instead use xmm3 because that was freed up in the sin path, xmm3 is overwritten in sin path + movsd %xmm0, %xmm3 + movsd %xmm0, %xmm2 + mulsd %xmm0, %xmm2 # x2 + +## if region is 0 or 2 do a sin calc. + and %eax, %r8d + jnz .Lcosregion + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# region 0 or 2 do a sin calculation + movsd .Lsinarray+0x50(%rip), %xmm3 # s6 + mulsd %xmm2, %xmm3 # x2s6 + movsd .Lsinarray+0x20(%rip), %xmm5 # s3 + movsd %xmm4,p_temp(%rsp) # store xx + movsd %xmm2, %xmm1 # move for x4 + mulsd %xmm2, %xmm1 # x4 + movsd %xmm0,p_temp1(%rsp) # store x + mulsd %xmm2, %xmm5 # x2s3 + movsd %xmm0, %xmm4 # move for x3 + addsd .Lsinarray+0x40(%rip), %xmm3 # s5+x2s6 + mulsd %xmm2, %xmm1 # x6 + mulsd %xmm2, %xmm3 # x2(s5+x2s6) + mulsd %xmm2, %xmm4 # x3 + addsd .Lsinarray+0x10(%rip), %xmm5 # s2+x2s3 + mulsd %xmm2, %xmm5 # x2(s2+x2s3) + addsd .Lsinarray+0x30(%rip), %xmm3 # s4 + x2(s5+x2s6) + mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 *x2 + movsd p_temp(%rsp), %xmm0 # load xx + mulsd %xmm1, %xmm3 # x6(s4 + x2(s5+x2s6)) + addsd .Lsinarray(%rip), %xmm5 # s1+x2(s2+x2s3) + mulsd %xmm0, %xmm2 # 0.5 * x2 *xx + addsd %xmm5, %xmm3 # zs + mulsd %xmm3, %xmm4 # *x3 + subsd %xmm2, %xmm4 # x3*zs - 0.5 * x2 *xx + addsd %xmm4, %xmm0 # +xx + addsd p_temp1(%rsp), %xmm0 # +x + jmp .Ladjust_region + +.align 16 +.Lcosregion: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# region 1 or 3 - do a cos calculation +# zc = (c2 + x2 * (c3 + x2 * (c4 + x2 * (c5 + x2 * c6)))); + mulsd %xmm0, %xmm4 # x*xx + movsd .L__real_3fe0000000000000(%rip), %xmm5 + movsd .Lcosarray+0x50(%rip), %xmm1 # c6 + movsd .Lcosarray+0x20(%rip), %xmm0 # c3 + mulsd %xmm2, %xmm5 # r = 0.5 *x2 + movsd %xmm2, %xmm3 # copy of x2 + movsd %xmm4,p_temp(%rsp) # store x*xx + mulsd %xmm2, %xmm1 # c6*x2 + mulsd %xmm2, %xmm0 # c3*x2 + subsd .L__real_3ff0000000000000(%rip), %xmm5 # -t=r-1.0 ;trash r + mulsd %xmm2, %xmm3 # x4 + addsd .Lcosarray+0x40(%rip), %xmm1 # c5+x2c6 + addsd .Lcosarray+0x10(%rip), %xmm0 # c2+x2C3 + addsd .L__real_3ff0000000000000(%rip), %xmm5 # 1 + (-t) ;trash t + mulsd %xmm2, %xmm3 # x6 + mulsd %xmm2, %xmm1 # x2(c5+x2c6) + mulsd %xmm2, %xmm0 # x2(c2+x2C3) + movsd %xmm2, %xmm4 # copy of x2 + mulsd .L__real_3fe0000000000000(%rip), %xmm4 # r recalculate + addsd .Lcosarray+0x30(%rip), %xmm1 # c4 + x2(c5+x2c6) + addsd .Lcosarray(%rip), %xmm0 # c1+x2(c2+x2C3) + mulsd %xmm2, %xmm2 # x4 recalculate + subsd %xmm4, %xmm5 # (1 + (-t)) - r + mulsd %xmm3, %xmm1 # x6(c4 + x2(c5+x2c6)) + addsd %xmm1, %xmm0 # zc + subsd .L__real_3ff0000000000000(%rip), %xmm4 # t relaculate + subsd p_temp(%rsp), %xmm5 # ((1 + (-t)) - r) - x*xx + mulsd %xmm2, %xmm0 # x4 * zc + addsd %xmm5, %xmm0 # x4 * zc + ((1 + (-t)) - r -x*xx) + subsd %xmm4, %xmm0 # result - (-t) + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +.align 16 +.Ladjust_region: # positive or negative +# switch (region) + shr $1, %eax + mov %eax, %ecx + and %r11d, %eax + not %ecx + not %r11d + and %r11d, %ecx + or %ecx, %eax + and $1, %eax + jnz .Lsin_cleanup + +## if the original region 0, 1 and arg is negative, then we negate the result. +## if the original region 2, 3 and arg is positive, then we negate the result. + movsd %xmm0, %xmm2 + xorpd %xmm0, %xmm0 + subsd %xmm2, %xmm0 + +.align 16 +.Lsin_cleanup: + add $stack_size, %rsp + ret + +.align 16 +.Lsin_naninf: + call fname_special + add $stack_size, %rsp + ret + + +
diff --git a/src/gas/sincos.S b/src/gas/sincos.S new file mode 100644 index 0000000..6558f9e --- /dev/null +++ b/src/gas/sincos.S
@@ -0,0 +1,616 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# +# An implementation of the sincos function. +# +# Prototype: +# +# void sincos(double x, double* sinr, double* cosr); +# +# Computes sincos +# It will provide proper C99 return values, +# but may not raise floating point status bits properly. +# Based on the NAG C implementation. +# +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 16 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0 # for alignment +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0 +.L__real_411E848000000000: .quad 0x415312d000000000 # 5e6 0x0411E848000000000 # 5e5 + .quad 0 + +.align 16 +.Lsincosarray: + .quad 0x0bfc5555555555555 # -0.16666666666666666 s1 + .quad 0x03fa5555555555555 # 0.041666666666666664 c1 + .quad 0x03f81111111110bb3 # 0.00833333333333095 s2 + .quad 0x0bf56c16c16c16967 # -0.0013888888888887398 c2 + .quad 0x0bf2a01a019e83e5c # -0.00019841269836761127 s3 + .quad 0x03efa01a019f4ec90 # 2.4801587298767041E-05 c3 + .quad 0x03ec71de3796cde01 # 2.7557316103728802E-06 s4 + .quad 0x0be927e4fa17f65f6 # -2.7557317272344188E-07 c4 + .quad 0x0be5ae600b42fdfa7 # -2.5051132068021698E-08 s5 + .quad 0x03e21eeb69037ab78 # 2.0876146382232963E-09 c6 + .quad 0x03de5e0b2f9a43bb8 # 1.5918144304485914E-10 s6 + .quad 0x0bda907db46cc5e42 # -1.1382639806794487E-11 c7 + +.align 16 +.Lcossinarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bf56c16c16c16967 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03efa01a019f4ec90 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0be927e4fa17f65f6 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03e21eeb69037ab78 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0bda907db46cc5e42 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + +.text +.align 16 +.p2align 4,,15 + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(sincos) +#define fname_special _sincos_special@PLT + +# define local variable storage offsets +.equ p_temp, 0x30 # temporary for get/put bits operation +.equ p_temp1, 0x40 # temporary for get/put bits operation +.equ r, 0x50 # pointer to r for amd_remainder_piby2 +.equ rr, 0x60 # pointer to rr for amd_remainder_piby2 +.equ region, 0x70 # pointer to region for amd_remainder_piby2 +.equ stack_size, 0x98 + +.globl fname +.type fname,@function + +fname: + sub $stack_size, %rsp + xorpd %xmm2,%xmm2 # zeroed out for later use + +# GET_BITS_DP64(x, ux); +# get the input value to an integer register. + movsd %xmm0,p_temp(%rsp) + mov p_temp(%rsp),%rcx # rcx is ux + +## if NaN or inf + mov $0x07ff0000000000000,%rax + mov %rax,%r10 + and %rcx,%r10 + cmp %rax,%r10 + jz .Lsincos_naninf + +# ax = (ux & ~SIGNBIT_DP64); + mov $0x07fffffffffffffff,%r10 + and %rcx,%r10 # r10 is ax + +## if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */ + mov $0x03fe921fb54442d18,%rax + cmp %rax,%r10 + jg .Lsincos_reduce + +## if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */ + mov $0x03f20000000000000,%rax + cmp %rax,%r10 + jge .Lsincos_small + +## if (ax < 0x3e40000000000000) /* abs(x) < 2.0^(-27) */ + mov $0x03e40000000000000,%rax + cmp %rax,%r10 + jge .Lsincos_smaller + + # sin = x; + movsd .L__real_3ff0000000000000(%rip),%xmm1 # cos = 1.0; + jmp .Lsincos_cleanup + +## else +.align 32 +.Lsincos_smaller: +# sin = x - x^3 * 0.1666666666666666666; +# cos = 1.0 - x*x*0.5; + + movsd %xmm0,%xmm2 + movsd .L__real_3fc5555555555555(%rip),%xmm4 # 0.1666666666666666666 + mulsd %xmm2,%xmm2 # x^2 + movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1.0 + movsd %xmm2,%xmm3 # copy of x^2 + + mulsd %xmm0,%xmm2 # x^3 + mulsd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5 * x^2 + mulsd %xmm4,%xmm2 # x^3 * 0.1666666666666666666 + subsd %xmm2,%xmm0 # x - x^3 * 0.1666666666666666666, sin + subsd %xmm3,%xmm1 # 1 - 0.5 * x^2, cos + + jmp .Lsincos_cleanup + + +## else + +.align 16 +.Lsincos_small: +# sin = sin_piby4(x, 0.0); + movsd .L__real_3fe0000000000000(%rip),%xmm5 # .5 + +# x2 = r * r; + movsd %xmm0,%xmm2 + mulsd %xmm0,%xmm2 # x2 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# region 0 or 2 - do a sin calculation +# zs = (s2 + x2 * (s3 + x2 * (s4 + x2 * (s5 + x2 * s6)))); + + movlhps %xmm2,%xmm2 + movapd .Lsincosarray+0x50(%rip),%xmm3 # s6 + movapd %xmm2,%xmm1 # move for x4 + movdqa .Lsincosarray+0x20(%rip),%xmm5 # s3 + mulpd %xmm2,%xmm3 # x2s6 + addpd .Lsincosarray+0x40(%rip),%xmm3 # s5+x2s6 + mulpd %xmm2,%xmm5 # x2s3 + movapd %xmm4,p_temp(%rsp) # rr move to to memory + mulpd %xmm2,%xmm1 # x4 + mulpd %xmm2,%xmm3 # x2(s5+x2s6) + addpd .Lsincosarray+0x10(%rip),%xmm5 # s2+x2s3 + movapd %xmm1,%xmm4 # move for x6 + addpd .Lsincosarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6) + mulpd %xmm2,%xmm5 # x2(s2+x2s3) + mulpd %xmm2,%xmm4 # x6 + addpd .Lsincosarray(%rip),%xmm5 # s1+x2(s2+x2s3) + mulpd %xmm4,%xmm3 # x6(s4 + x2(s5+x2s6)) + + movsd %xmm2,%xmm4 # xmm4 = x2 for 0.5x2 for cos + # xmm2 contains x2 for x3 for sin + addpd %xmm5,%xmm3 # zs in lower and zc upper + + mulsd %xmm0,%xmm2 # xmm2=x3 for sin + + movhlps %xmm3,%xmm5 # Copy z, xmm5 = cos , xmm3 = sin + + mulsd .L__real_3fe0000000000000(%rip),%xmm4 # xmm4=r=0.5*x2 for cos term + mulsd %xmm2,%xmm3 # sin *x3 + mulsd %xmm1,%xmm5 # cos *x4 + movsd .L__real_3ff0000000000000(%rip),%xmm2 # 1.0 + subsd %xmm4,%xmm2 # t=1.0-r + movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1.0 + subsd %xmm2,%xmm1 # 1 - t + subsd %xmm4,%xmm1 # (1-t) -r + addsd %xmm5,%xmm1 # ((1-t) -r) + cos + addsd %xmm3,%xmm0 # xmm0= sin+x, final sin term + addsd %xmm2,%xmm1 # xmm1 = t +{ ((1-t) -r) + cos}, final cos term + + jmp .Lsincos_cleanup +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsincos_reduce: +# change rdx to rcx and r8 to r9 +# rcx= ux, r10 = ax +# %r9,%rax are free + +# xneg = (ax != ux); + cmp %r10,%rcx + mov $0,%r11d + +## if (xneg) x = -x; + jz .LPositive + mov $1,%r11d + subsd %xmm0,%xmm2 + movsd %xmm2,%xmm0 + +# rcx= ux, r10 = ax, r11= Sign +# %r9,%rax are free +# change rdx to rcx and r8 to r9 + +.align 16 +.LPositive: +## if (x < 5.0e5) + cmp .L__real_411E848000000000(%rip),%r10 + jae .Lsincos_reduce_precise + +# reduce the argument to be in a range from -pi/4 to +pi/4 +# by subtracting multiples of pi/2 + movsd %xmm0,%xmm2 + movsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # twobypi + movsd %xmm0,%xmm4 + movsd .L__real_3fe0000000000000(%rip),%xmm5 # .5 + mulsd %xmm3,%xmm2 + +#/* How many pi/2 is x a multiple of? */ +# xexp = ax >> EXPSHIFTBITS_DP64; + shr $52,%r10 # >>EXPSHIFTBITS_DP64 + +# npi2 = (int)(x * twobypi + 0.5); + addsd %xmm5,%xmm2 # npi2 + + movsd .L__real_3ff921fb54400000(%rip),%xmm3 # piby2_1 + cvttpd2dq %xmm2,%xmm0 # convert to integer + movsd .L__real_3dd0b4611a626331(%rip),%xmm1 # piby2_1tail + cvtdq2pd %xmm0,%xmm2 # and back to float. + +# /* Subtract the multiple from x to get an extra-precision remainder */ +# rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm3 + subsd %xmm3,%xmm4 # rhead + +# rtail = npi2 * piby2_1tail; + mulsd %xmm2,%xmm1 + movd %xmm0,%eax + + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead + movsd %xmm4,%xmm0 + subsd %xmm1,%xmm0 + + movsd .L__real_3dd0b4611a600000(%rip),%xmm3 # piby2_2 + movsd %xmm0,p_temp(%rsp) + movsd .L__real_3ba3198a2e037073(%rip),%xmm5 # piby2_2tail + mov %eax,%ecx + mov p_temp(%rsp),%r9 # rcx is rhead-rtail + +# xmm0=r, xmm4=rhead, xmm1=rtail, xmm2=npi2, xmm3=temp for calc, xmm5= temp for calc +# expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + shl $1,%r9 # strip any sign bit + shr $53,%r9 # >> EXPSHIFTBITS_DP64 +1 + sub %r9,%r10 # expdiff + +## if (expdiff > 15) + cmp $15,%r10 + jle .Lexpdiff15 + +# /* The remainder is pretty small compared with x, which +# implies that x is a near multiple of pi/2 +# (x matches the multiple to at least 15 bits) */ + +# t = rhead; + movsd %xmm4,%xmm1 + +# rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm3 + +# rhead = t - rtail; + mulsd %xmm2,%xmm5 # npi2 * piby2_2tail + subsd %xmm3,%xmm4 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + subsd %xmm4,%xmm1 # t - rhead + subsd %xmm3,%xmm1 # -rtail + subsd %xmm1,%xmm5 # rtail + +# r = rhead - rtail; + movsd %xmm4,%xmm0 + +#HARSHA +#xmm1=rtail + movsd %xmm5,%xmm1 + subsd %xmm5,%xmm0 + +# xmm0=r, xmm4=rhead, xmm1=rtail +.Lexpdiff15: +# region = npi2 & 3; + + subsd %xmm0,%xmm4 # rhead-r + subsd %xmm1,%xmm4 # rr = (rhead-r) - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +## if the input was close to a pi/2 multiple +# The original NAG code missed this trick. If the input is very close to n*pi/2 after +# reduction, +# then the sin is ~ 1.0 , to within 53 bits, when r is < 2^-27. We already +# have x at this point, so we can skip the sin polynomials. + + cmp $0x03f2,%r9 # if r small. + jge .Lcossin_piby4 # use taylor series if not + cmp $0x03de,%r9 # if r really small. + jle .Lr_small # then sin(r) = 0 + + movsd %xmm0,%xmm2 + mulsd %xmm2,%xmm2 # x^2 + +## if region is 0 or 2 do a sin calc. + and $1,%ecx + jnz .Lregion13 + +# region 0 or 2 do a sincos calculation +# use simply polynomial +# sin=x - x*x*x*0.166666666666666666; + movsd .L__real_3fc5555555555555(%rip),%xmm3 # 0.166666666 + mulsd %xmm0,%xmm3 # * x + mulsd %xmm2,%xmm3 # * x^2 + subsd %xmm3,%xmm0 # xs +# cos=1.0 - x*x*0.5; + movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1.0 + mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x^2 + subsd %xmm2,%xmm1 # xc + + jmp .Ladjust_region + +.align 16 +.Lregion13: +# region 1 or 3 do a cossin calculation +# use simply polynomial +# sin=x - x*x*x*0.166666666666666666; + movsd %xmm0,%xmm1 + + movsd .L__real_3fc5555555555555(%rip),%xmm3 # 0.166666666 + mulsd %xmm0,%xmm3 # 0.166666666* x + mulsd %xmm2,%xmm3 # 0.166666666* x * x^2 + subsd %xmm3,%xmm1 # xs +# cos=1.0 - x*x*0.5; + movsd .L__real_3ff0000000000000(%rip),%xmm0 # 1.0 + mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x^2 + subsd %xmm2,%xmm0 # xc + + jmp .Ladjust_region + +.align 16 +.Lr_small: +## if region is 0 or 2 do a sincos calc. + movsd .L__real_3ff0000000000000(%rip),%xmm1 # cos(r) is a 1 + and $1,%ecx + jz .Ladjust_region + +## if region is 1 or 3 do a cossin calc. + movsd %xmm0,%xmm1 # sin(r) is r + movsd .L__real_3ff0000000000000(%rip),%xmm0 # cos(r) is a 1 + jmp .Ladjust_region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsincos_reduce_precise: +# // Reduce x into range [-pi/4,pi/4] +# __amd_remainder_piby2(x, &r, &rr, ®ion); + + mov %rdi, p_temp1(%rsp) + mov %rsi, p_temp1+8(%rsp) + mov %r11,p_temp(%rsp) + + lea region(%rsp),%rdx + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + + call __amd_remainder_piby2@PLT + + mov p_temp1(%rsp), %rdi + mov p_temp1+8(%rsp), %rsi + mov p_temp(%rsp),%r11 + + movsd r(%rsp),%xmm0 # x + movsd rr(%rsp),%xmm4 # xx + mov region(%rsp),%eax # region to classify for sin/cos calc + mov %eax,%ecx # region to get sign + +# xmm0 = x, xmm4 = xx, r8d = 1, eax= region +.align 16 +.Lcossin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# perform taylor series to calc sinx, sinx +# x2 = r * r; +#xmm4 = a part of rr for the sin path, xmm4 is overwritten in the sin path +#instead use xmm3 because that was freed up in the sin path, xmm3 is overwritten in sin path + + movsd %xmm0,%xmm2 + mulsd %xmm0,%xmm2 #x2 + +## if region is 0 or 2 do a sincos calc. + and $1,%ecx + jz .Lsincos02 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# region 1 or 3 - do a cossin calculation +# zc = (c2 + x2 * (c3 + x2 * (c4 + x2 * (c5 + x2 * c6)))); + + + movlhps %xmm2,%xmm2 + + movapd .Lcossinarray+0x50(%rip),%xmm3 # s6 + movapd %xmm2,%xmm1 # move for x4 + movdqa .Lcossinarray+0x20(%rip),%xmm5 # s3 + mulpd %xmm2,%xmm3 # x2s6 + addpd .Lcossinarray+0x40(%rip),%xmm3 # s5+x2s6 + mulpd %xmm2,%xmm5 # x2s3 + movsd %xmm4,p_temp(%rsp) # rr move to to memory + mulpd %xmm2,%xmm1 # x4 + mulpd %xmm2,%xmm3 # x2(s5+x2s6) + addpd .Lcossinarray+0x10(%rip),%xmm5 # s2+x2s3 + movapd %xmm1,%xmm4 # move for x6 + addpd .Lcossinarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6) + mulpd %xmm2,%xmm5 # x2(s2+x2s3) + mulpd %xmm2,%xmm4 # x6 + addpd .Lcossinarray(%rip),%xmm5 # s1+x2(s2+x2s3) + mulpd %xmm4,%xmm3 # x6(s4 + x2(s5+x2s6)) + + movsd %xmm2,%xmm4 # xmm4 = x2 for 0.5x2 cos + # xmm2 contains x2 for x3 sin + + addpd %xmm5,%xmm3 # zc in lower and zs in upper + + mulsd %xmm0,%xmm2 # xmm2=x3 for the sin term + + movhlps %xmm3,%xmm5 # Copy z, xmm5 = sin, xmm3 = cos + mulsd .L__real_3fe0000000000000(%rip),%xmm4 # xmm4=r=0.5*x2 for cos term + + mulsd %xmm2,%xmm5 # sin *x3 + mulsd %xmm1,%xmm3 # cos *x4 + movsd %xmm0,p_temp1(%rsp) # store x + movsd %xmm0,%xmm1 + + movsd .L__real_3ff0000000000000(%rip),%xmm2 # 1.0 + subsd %xmm4,%xmm2 # t=1.0-r + + movsd .L__real_3ff0000000000000(%rip),%xmm0 # 1.0 + subsd %xmm2,%xmm0 # 1 - t + + mulsd p_temp(%rsp),%xmm1 # x*xx + subsd %xmm4,%xmm0 # (1-t) -r + subsd %xmm1,%xmm0 # ((1-t) -r) - x *xx + + mulsd p_temp(%rsp),%xmm4 # 0.5*x2*xx + + addsd %xmm3,%xmm0 # (((1-t) -r) - x *xx) + cos + + subsd %xmm4,%xmm5 # sin - 0.5*x2*xx + + addsd %xmm2,%xmm0 # xmm0 = t +{ (((1-t) -r) - x *xx) + cos}, final cos term + + addsd p_temp(%rsp),%xmm5 # sin + xx + movsd p_temp1(%rsp),%xmm1 # load x + addsd %xmm5,%xmm1 # xmm1= sin+x, final sin term + + jmp .Ladjust_region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsincos02: +# region 0 or 2 do a sincos calculation + movlhps %xmm2,%xmm2 + + movapd .Lsincosarray+0x50(%rip),%xmm3 # s6 + movapd %xmm2,%xmm1 # move for x4 + movdqa .Lsincosarray+0x20(%rip),%xmm5 # s3 + mulpd %xmm2,%xmm3 # x2s6 + addpd .Lsincosarray+0x40(%rip),%xmm3 # s5+x2s6 + mulpd %xmm2,%xmm5 # x2s3 + movsd %xmm4,p_temp(%rsp) # rr move to to memory + mulpd %xmm2,%xmm1 # x4 + mulpd %xmm2,%xmm3 # x2(s5+x2s6) + addpd .Lsincosarray+0x10(%rip),%xmm5 # s2+x2s3 + movapd %xmm1,%xmm4 # move for x6 + addpd .Lsincosarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6) + mulpd %xmm2,%xmm5 # x2(s2+x2s3) + mulpd %xmm2,%xmm4 # x6 + addpd .Lsincosarray(%rip),%xmm5 # s1+x2(s2+x2s3) + mulpd %xmm4,%xmm3 # x6(s4 + x2(s5+x2s6)) + + movsd %xmm2,%xmm4 # xmm4 = x2 for 0.5x2 for cos + # xmm2 contains x2 for x3 for sin + + addpd %xmm5,%xmm3 # zs in lower and zc in upper + + mulsd %xmm0,%xmm2 # xmm2=x3 for sin + + movhlps %xmm3,%xmm5 # Copy z, xmm5 = cos , xmm3 = sin + + mulsd .L__real_3fe0000000000000(%rip),%xmm4 # xmm4=r=0.5*x2 for cos term + + mulsd %xmm2,%xmm3 # sin *x3 + mulsd %xmm1,%xmm5 # cos *x4 + + movsd .L__real_3ff0000000000000(%rip),%xmm2 # 1.0 + subsd %xmm4,%xmm2 # t=1.0-r + + movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1.0 + subsd %xmm2,%xmm1 # 1 - t + + movsd %xmm0,p_temp1(%rsp) # store x + mulsd p_temp(%rsp),%xmm0 # x*xx + + subsd %xmm4,%xmm1 # (1-t) -r + subsd %xmm0,%xmm1 # ((1-t) -r) - x *xx + + mulsd p_temp(%rsp),%xmm4 # 0.5*x2*xx + + addsd %xmm5,%xmm1 # (((1-t) -r) - x *xx) + cos + + subsd %xmm4,%xmm3 # sin - 0.5*x2*xx + + addsd %xmm2,%xmm1 # xmm1 = t +{ (((1-t) -r) - x *xx) + cos}, final cos term + + addsd p_temp(%rsp),%xmm3 # sin + xx + movsd p_temp1(%rsp),%xmm0 # load x + addsd %xmm3,%xmm0 # xmm0= sin+x, final sin term + + jmp .Ladjust_region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# switch (region) +.align 16 +.Ladjust_region: # positive or negative for sin return val in xmm0 + + mov %eax,%r9d + + shr $1,%eax + mov %eax,%ecx + and %r11d,%eax + + not %ecx + not %r11d + and %r11d,%ecx + + or %ecx,%eax + and $1,%eax + jnz .Lcos_sign + +## if the original region 0, 1 and arg is negative, then we negate the result. +## if the original region 2, 3 and arg is positive, then we negate the result. + movsd %xmm0,%xmm2 + xorpd %xmm0,%xmm0 + subsd %xmm2,%xmm0 + +.Lcos_sign: # positive or negative for cos return val in xmm1 + add $1,%r9 + and $2,%r9d + jz .Lsincos_cleanup +## if the original region 1 or 2 then we negate the result. + movsd %xmm1,%xmm2 + xorpd %xmm1,%xmm1 + subsd %xmm2,%xmm1 + +#.align 16 +.Lsincos_cleanup: + movsd %xmm0, (%rdi) # save the sin + movsd %xmm1, (%rsi) # save the cos + + add $stack_size,%rsp + ret + +.align 16 +.Lsincos_naninf: + call fname_special + add $stack_size, %rsp + ret +
diff --git a/src/gas/sincosf.S b/src/gas/sincosf.S new file mode 100644 index 0000000..dcdbe9a --- /dev/null +++ b/src/gas/sincosf.S
@@ -0,0 +1,402 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# +# An implementation of the sincosf function. +# +# Prototype: +# +# void fastsincosf(float x, float * sinfx, float * cosfx); +# +# Computes sinf(x) and cosf(x). +# It will provide proper C99 return values, +# but may not raise floating point status bits properly. +# Based on the NAG C implementation. +# Author: Harsha Jagasia +# Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 32 +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0 # for alignment +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0 +.L__real_3FF921FB54442D18: .quad 0x03FF921FB54442D18 # piby2 + .quad 0 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0 +.L__real_411E848000000000: .quad 0x415312d000000000 # 5e6 0x0411E848000000000 # 5e5 + .quad 0 + +.align 32 +.Lcsarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x0bf56c16c16c16967 # -0.00138889 c2 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6 + +.text +.align 16 +.p2align 4,,15 + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(sincosf) +#define fname_special _sincosf_special@PLT + +# define local variable storage offsets +.equ p_temp, 0x30 # temporary for get/put bits operation +.equ p_temp1, 0x40 # temporary for get/put bits operation +.equ p_temp2, 0x50 # temporary for get/put bits operation +.equ p_temp3, 0x60 # temporary for get/put bits operation +.equ region, 0x70 # pointer to region for amd_remainder_piby2 +.equ r, 0x80 # pointer to r for amd_remainder_piby2 +.equ stack_size, 0xa8 + +.globl fname +.type fname,@function + +fname: + sub $stack_size, %rsp + + xorpd %xmm2,%xmm2 + +# GET_BITS_DP64(x, ux); +# convert input to double. + cvtss2sd %xmm0,%xmm0 +# get the input value to an integer register. + movsd %xmm0,p_temp(%rsp) + mov p_temp(%rsp),%rdx # rdx is ux + +## if NaN or inf + mov $0x07ff0000000000000,%rax + mov %rax,%r10 + and %rdx,%r10 + cmp %rax,%r10 + jz .L__sc_naninf + +# ax = (ux & ~SIGNBIT_DP64); + mov $0x07fffffffffffffff,%r10 + and %rdx,%r10 # r10 is ax + +## if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */ + mov $0x03fe921fb54442d18,%rax + cmp %rax,%r10 + jg .L__sc_reduce + +## if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */ + mov $0x3f20000000000000, %rax + cmp %rax, %r10 + jge .L__sc_notsmallest + +# sinf = x, cos=1.0 + movsd .L__real_3ff0000000000000(%rip),%xmm1 + jmp .L__sc_cleanup + +# *s = sin_piby4(x, 0.0); +# *c = cos_piby4(x, 0.0); +.L__sc_notsmallest: + xor %eax,%eax # region 0 + mov %r10,%rdx + movsd .L__real_3fe0000000000000(%rip),%xmm5 # .5 + jmp .L__sc_piby4 + +.L__sc_reduce: + +# reduce the argument to be in a range from -pi/4 to +pi/4 +# by subtracting multiples of pi/2 + +# xneg = (ax != ux); + cmp %r10,%rdx +## if (xneg) x = -x; + jz .Lpositive + subsd %xmm0,%xmm2 + movsd %xmm2,%xmm0 + +.align 16 +.Lpositive: +## if (x < 5.0e5) + cmp .L__real_411E848000000000(%rip),%r10 + jae .Lsincosf_reduce_precise + + movsd %xmm0,%xmm2 + movsd %xmm0,%xmm4 + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # twobypi + movsd .L__real_3fe0000000000000(%rip),%xmm5 # .5 + +#/* How many pi/2 is x a multiple of? */ +# xexp = ax >> EXPSHIFTBITS_DP64; + mov %r10,%r9 + shr $52,%r9 # >> EXPSHIFTBITS_DP64 + +# npi2 = (int)(x * twobypi + 0.5); + addsd %xmm5,%xmm2 # npi2 + + movsd .L__real_3ff921fb54400000(%rip),%xmm3 # piby2_1 + cvttpd2dq %xmm2,%xmm0 # convert to integer + movsd .L__real_3dd0b4611a626331(%rip),%xmm1 # piby2_1tail + cvtdq2pd %xmm0,%xmm2 # and back to float. + +# /* Subtract the multiple from x to get an extra-precision remainder */ +# rhead = x - npi2 * piby2_1; +# /* Subtract the multiple from x to get an extra-precision remainder */ +# rhead = x - npi2 * piby2_1; + + mulsd %xmm2,%xmm3 # use piby2_1 + subsd %xmm3,%xmm4 # rhead + +# rtail = npi2 * piby2_1tail; + mulsd %xmm2,%xmm1 # rtail + + movd %xmm0,%eax + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead + movsd %xmm4,%xmm0 + subsd %xmm1,%xmm0 + + movsd .L__real_3dd0b4611a600000(%rip),%xmm3 # piby2_2 + movsd .L__real_3ba3198a2e037073(%rip),%xmm5 # piby2_2tail + movd %xmm0,%rcx + +# expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + shl $1,%rcx # strip any sign bit + shr $53,%rcx # >> EXPSHIFTBITS_DP64 +1 + sub %rcx,%r9 # expdiff + +## if (expdiff > 15) + cmp $15,%r9 + jle .Lexpdiff15 + +# /* The remainder is pretty small compared with x, which +# implies that x is a near multiple of pi/2 +# (x matches the multiple to at least 15 bits) */ + +# t = rhead; + movsd %xmm4,%xmm1 + +# rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm3 + +# rhead = t - rtail; + mulsd %xmm2,%xmm5 # npi2 * piby2_2tail + subsd %xmm3,%xmm4 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + subsd %xmm4,%xmm1 # t - rhead + subsd %xmm3,%xmm1 # -rtail + subsd %xmm1,%xmm5 # rtail + +# r = rhead - rtail; + movsd %xmm4,%xmm0 + +#HARSHA +#xmm1=rtail + movsd %xmm5,%xmm1 + subsd %xmm5,%xmm0 + +# region = npi2 & 3; +# and $3,%eax +# xmm0=r, xmm4=rhead, xmm1=rtail +.Lexpdiff15: + +## if the input was close to a pi/2 multiple +# + + cmp $0x03f2,%rcx # if r small. + jge .L__sc_piby4 # use taylor series if not + cmp $0x03de,%rcx # if r really small. + jle .Lsinsmall # then sin(r) = r + + movsd %xmm0,%xmm2 + mulsd %xmm2,%xmm2 # x^2 +# use simply polynomial +# *s = x - x*x*x*0.166666666666666666; + movsd .L__real_3fc5555555555555(%rip),%xmm3 # + mulsd %xmm0,%xmm3 # * x + mulsd %xmm2,%xmm3 # * x^2 + subsd %xmm3,%xmm0 # xs + +# *c = 1.0 - x*x*0.5; + movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1.0 + mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x^2 + subsd %xmm2,%xmm1 + jmp .L__adjust_region + +.Lsinsmall: # then sin(r) = r + movsd .L__real_3ff0000000000000(%rip),%xmm1 # cos(r) is a 1 + jmp .L__adjust_region + +# perform taylor series to calc sinx, cosx +# COS +# x2 = x * x; +# return (1.0 - 0.5 * x2 + (x2 * x2 * +# (c1 + x2 * (c2 + x2 * (c3 + x2 * c4))))); +# x2 = x * x; +# return (1.0 - 0.5 * x2 + (x2 * x2 * +# (c1 + x2 * (c2 + x2 * (c3 + x2 * c4))))); +# SIN +# zc,zs = (c2 + x2 * (c3 + x2 * c4 )); +# xs = r + x3 * (sc1 + x2 * zs); +# x2 = x * x; +# return (x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4)))); +# done with reducing the argument. Now perform the sin/cos calculations. +.align 16 +.L__sc_piby4: +# x2 = r * r; + movsd .L__real_3fe0000000000000(%rip),%xmm5 # .5 + movsd %xmm0,%xmm2 + mulsd %xmm0,%xmm2 # x2 + shufpd $0,%xmm2,%xmm2 # x2,x2 + movsd %xmm2,%xmm4 + mulsd %xmm4,%xmm4 # x4 + shufpd $0,%xmm4,%xmm4 # x4,x4 + +# x2m = _mm_set1_pd (x2); +# zc,zs = (c2 + x2 * (c3 + x2 * c4 )); +# xs = r + x3 * (sc1 + x2 * zs); +# xc = t + ( x2 * x2 * (cc1 + x2 * zc)); + movapd .Lcsarray+0x30(%rip),%xmm1 # c4 + movapd .Lcsarray+0x10(%rip),%xmm3 # c2 + mulpd %xmm2,%xmm1 # x2c4 + mulpd %xmm2,%xmm3 # x2c2 + +# rc = 0.5 * x2; + mulsd %xmm2,%xmm5 #rc + mulsd %xmm0,%xmm2 #x3 + + addpd .Lcsarray+0x20(%rip),%xmm1 # c3 + x2c4 + addpd .Lcsarray(%rip),%xmm3 # c1 + x2c2 + mulpd %xmm4,%xmm1 # x4(c3 + x2c4) + addpd %xmm3,%xmm1 # c1 + x2c2 + x4(c3 + x2c4) + +# -t = rc-1; + subsd .L__real_3ff0000000000000(%rip),%xmm5 # 1.0 +# now we have the poly for sin in the low half, and cos in upper half + mulsd %xmm1,%xmm2 # x3(sin poly) + shufpd $3,%xmm1,%xmm1 # get cos poly to low half of register + mulsd %xmm4,%xmm1 # x4(cos poly) + + addsd %xmm2,%xmm0 # sin = r+... + subsd %xmm5,%xmm1 # cos = poly-(-t) + +.L__adjust_region: # xmm0 is sin, xmm1 is cos +# switch (region) + mov %eax,%ecx + and $1,%eax + jz .Lregion02 +# region 1 or 3 + movsd %xmm0,%xmm2 # swap sin,cos + movsd %xmm1,%xmm0 # sin = cos + xorpd %xmm1,%xmm1 + subsd %xmm2,%xmm1 # cos = -sin + +.Lregion02: + and $2,%ecx + jz .Lregion23 +# region 2 or 3 + movsd %xmm0,%xmm2 + movsd %xmm1,%xmm3 + xorpd %xmm0,%xmm0 + xorpd %xmm1,%xmm1 + subsd %xmm2,%xmm0 # sin = -sin + subsd %xmm3,%xmm1 # cos = -cos + +.Lregion23: +## if (xneg) *s = -*s ; + cmp %r10,%rdx + jz .L__sc_cleanup + movsd %xmm0,%xmm2 + xorpd %xmm0,%xmm0 + subsd %xmm2,%xmm0 # sin = -sin + +.align 16 +.L__sc_cleanup: + cvtsd2ss %xmm0,%xmm0 # convert back to floats + cvtsd2ss %xmm1,%xmm1 + + movss %xmm0,(%rdi) # save the sin + movss %xmm1,(%rsi) # save the cos + + add $stack_size,%rsp + ret + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsincosf_reduce_precise: +# /* Reduce abs(x) into range [-pi/4,pi/4] */ +# __amd_remainder_piby2(ax, &r, ®ion); + + mov %rdx,p_temp(%rsp) # save ux for use later + mov %r10,p_temp1(%rsp) # save ax for use later + mov %rdi,p_temp2(%rsp) # save ux for use later + mov %rsi,p_temp3(%rsp) # save ax for use later + movd %xmm0,%rdi + lea r(%rsp),%rsi + lea region(%rsp),%rdx + sub $0x040,%rsp + + call __amd_remainder_piby2d2f@PLT + + add $0x040,%rsp + mov p_temp(%rsp),%rdx # restore ux for use later + mov p_temp1(%rsp),%r10 # restore ax for use later + mov p_temp2(%rsp),%rdi # restore ux for use later + mov p_temp3(%rsp),%rsi # restore ax for use later + + mov $1,%r8d # for determining region later on + movsd r(%rsp),%xmm0 # r + mov region(%rsp),%eax # region + jmp .L__sc_piby4 + +.align 16 +.L__sc_naninf: + cvtsd2ss %xmm0,%xmm0 # convert back to floats + call fname_special # rdi and rsi are ready for the function call + add $stack_size, %rsp + ret
diff --git a/src/gas/sinf.S b/src/gas/sinf.S new file mode 100644 index 0000000..c2083ff --- /dev/null +++ b/src/gas/sinf.S
@@ -0,0 +1,436 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# +# An implementation of the sinf function. +# +# Prototype: +# +# double sinf(double x); +# +# Computes sinf(x). +# It will provide proper C99 return values, +# but may not raise floating point status bits properly. +# Based on the NAG C implementation. +# +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 32 +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0 # for alignment +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0 +.L__real_411E848000000000: .quad 0x415312d000000000 # 5e6 0x0411E848000000000 # 5e5 + .quad 0 + +.align 32 +.Lcosfarray: + .quad 0x0bfe0000000000000 # -0.5 c0 + .quad 0 + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0 + .quad 0x0bf56c16c16c16c16 # -0.00138889 c2 + .quad 0 + .quad 0x03EFA01A01A01A019 # 2.48016e-005 c3 + .quad 0 + .quad 0x0be927e4fb7789f5c # -2.75573e-007 c4 + .quad 0 + +.align 32 +.Lsinfarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0 + .quad 0x03f81111111111111 # 0.00833333 s2 + .quad 0 + .quad 0x0bf2a01a01a01a01a # -0.000198413 s3 + .quad 0 + .quad 0x03ec71de3a556c734 # 2.75573e-006 s4 + .quad 0 + +.text +.align 32 +.p2align 4,,15 + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(sinf) +#define fname_special _sinf_special@PLT + +# define local variable storage offsets +.equ p_temp, 0x30 # temporary for get/put bits operation +.equ p_temp1, 0x40 # temporary for get/put bits operation +.equ r, 0x50 # pointer to r for amd_remainder_piby2 +.equ region, 0x60 # pointer to region for amd_remainder_piby2 +.equ stack_size, 0x88 + +.globl fname +.type fname,@function + +fname: + sub $stack_size, %rsp + xorpd %xmm2, %xmm2 # zeroed out for later use + +## if NaN or inf + movd %xmm0, %edx + mov $0x07f800000, %eax + mov %eax, %r10d + and %edx, %r10d + cmp %eax, %r10d + jz .Lsinf_naninf + +# GET_BITS_DP64(x, ux); +# get the input value to an integer register. + cvtss2sd %xmm0, %xmm0 # convert input to double. + movsd %xmm0,p_temp(%rsp) # get the input value to an integer register. + + mov p_temp(%rsp), %rdx # rdx is ux + +# ax = (ux & ~SIGNBIT_DP64); + mov $0x07fffffffffffffff, %r10 + and %rdx, %r10 # r10 is ax + mov $1, %r8d # for determining region later on + +## if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */ + mov $0x03fe921fb54442d18, %rax + cmp %rax, %r10 + jg .Lsinf_reduce + +## if (ax < 0x3f80000000000000) /* abs(x) < 2.0^(-7) */ + mov $0x3f80000000000000, %rax + cmp %rax, %r10 + jge .Lsinf_small + +## if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */ + mov $0x3f20000000000000, %rax + cmp %rax, %r10 + jge .Lsinf_smaller + +# sinf = x; + jmp .Lsinf_cleanup # done + +## else + +.Lsinf_smaller: +# sinf = x - x^3 * 0.1666666666666666666; + movsd %xmm0, %xmm2 + movsd .L__real_3fc5555555555555(%rip), %xmm4 # 0.1666666666666666666 + mulsd %xmm2, %xmm2 # x^2 + mulsd %xmm0, %xmm2 # x^3 + mulsd %xmm4, %xmm2 # x^3 * 0.1666666666666666666 + subsd %xmm2, %xmm0 # x - x^3 * 0.1666666666666666666 + jmp .Lsinf_cleanup + +.Lsinf_small: + movsd %xmm0, %xmm2 + mulsd %xmm0, %xmm2 # x2 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# region 0 or 2 - do a sinf calculation +# zs = x + x3((s1 + x2 * s2) + x4(s3 + x2 * s4)); + movsd .Lsinfarray+0x30(%rip), %xmm1 # s4 + mulsd %xmm2, %xmm1 # s4x2 + movsd %xmm2, %xmm4 # move for x4 + movsd .Lsinfarray+0x10(%rip), %xmm5 # s2 + mulsd %xmm2, %xmm4 # x4 + movsd %xmm0, %xmm3 # move for x3 + mulsd %xmm2, %xmm5 # s2x2 + mulsd %xmm2, %xmm3 # x3 + addsd .Lsinfarray+0x20(%rip), %xmm1 # s3+s4x2 + mulsd %xmm4, %xmm1 # s3x4+s4x6 + addsd .Lsinfarray(%rip), %xmm5 # s1+s2x2 + addsd %xmm5, %xmm1 # s1+s2x2+s3x4+s4x6 + mulsd %xmm3, %xmm1 # x3(s1+s2x2+s3x4+s4x6) + addsd %xmm1, %xmm0 # x + x3(s1+s2x2+s3x4+s4x6) + jmp .Lsinf_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 32 +.Lsinf_reduce: + +# xneg = (ax != ux); + cmp %r10, %rdx + mov $0, %r11d + +## if (xneg) x = -x; + jz .L50e5 + mov $1, %r11d + subsd %xmm0, %xmm2 + movsd %xmm2, %xmm0 + +.L50e5: +## if (x < 5.0e5) + cmp .L__real_411E848000000000(%rip), %r10 + jae .Lsinf_reduce_precise + +# reduce the argument to be in a range from -pi/4 to +pi/4 +# by subtracting multiples of pi/2 + movsd %xmm0, %xmm2 + movsd .L__real_3fe45f306dc9c883(%rip), %xmm3 # twobypi + movsd %xmm0, %xmm4 + movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5 + mulsd %xmm3, %xmm2 + +#/* How many pi/2 is x a multiple of? */ +# xexp = ax >> EXPSHIFTBITS_DP64; + mov %r10, %r9 + shr $52, %r9 #>>EXPSHIFTBITS_DP64 + +# npi2 = (int)(x * twobypi + 0.5); + addsd %xmm5, %xmm2 # npi2 + + movsd .L__real_3ff921fb54400000(%rip), %xmm3 # piby2_1 + cvttpd2dq %xmm2, %xmm0 # convert to integer + movsd .L__real_3dd0b4611a626331(%rip), %xmm1 # piby2_1tail + cvtdq2pd %xmm0, %xmm2 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ +# rhead = x - npi2 * piby2_1; + mulsd %xmm2, %xmm3 + subsd %xmm3, %xmm4 # rhead + +# rtail = npi2 * piby2_1tail; + mulsd %xmm2, %xmm1 + movd %xmm0, %eax + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead + movsd %xmm4, %xmm0 + subsd %xmm1, %xmm0 + + movsd .L__real_3dd0b4611a600000(%rip), %xmm3 # piby2_2 + movsd %xmm0,p_temp(%rsp) + movsd .L__real_3ba3198a2e037073(%rip), %xmm5 # piby2_2tail + mov p_temp(%rsp), %rcx # rcx is rhead-rtail + +# xmm0=r, xmm4=rhead, xmm1=rtail, xmm2=npi2, xmm3=temp for calc, xmm5= temp for calc +# expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + shl $1, %rcx # strip any sign bit + shr $53, %rcx #>> EXPSHIFTBITS_DP64 +1 + sub %rcx, %r9 #expdiff + +## if (expdiff > 15) + cmp $15, %r9 + jle .Lexpdiff15 + +# /* The remainder is pretty small compared with x, which +# implies that x is a near multiple of pi/2 +# (x matches the multiple to at least 15 bits) */ + +# t = rhead; + movsd %xmm4, %xmm1 + +# rtail = npi2 * piby2_2; + mulsd %xmm2, %xmm3 + +# rhead = t - rtail; + mulsd %xmm2, %xmm5 # npi2 * piby2_2tail + subsd %xmm3, %xmm4 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + subsd %xmm4, %xmm1 # t - rhead + subsd %xmm3, %xmm1 # -rtail + subsd %xmm1, %xmm5 #rtail + +# r = rhead - rtail; + movsd %xmm4, %xmm0 + +#HARSHA +#xmm1=rtail + movsd %xmm5, %xmm1 + subsd %xmm5, %xmm0 + +# xmm0=r, xmm4=rhead, xmm1=rtail +.Lexpdiff15: +# region = npi2 & 3; +# No need rr for float case + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +## if the input was close to a pi/2 multiple +# The original NAG code missed this trick. If the input is very close to n*pi/2 after +# reduction, +# then the sinf is ~ 1.0 , to within 15 bits, when r is < 2^-13. We already +# have x at this point, so we can skip the sinf polynomials. + + cmp $0x03f2, %rcx ## if r small. + jge .Lsinf_piby4 # use taylor series if not + cmp $0x03de, %rcx ## if r really small. + jle .Lr_small # then sinf(r) = 0 + + movsd %xmm0, %xmm2 + mulsd %xmm2, %xmm2 #x^2 + +## if region is 0 or 2 do a sinf calc. + and %eax, %r8d + jnz .Lcosfregion + +# region 0 or 2 do a sinf calculation +# use simply polynomial +# x - x*x*x*0.166666666666666666; + movsd .L__real_3fc5555555555555(%rip), %xmm3 # + mulsd %xmm0, %xmm3 # * x + mulsd %xmm2, %xmm3 # * x^2 + subsd %xmm3, %xmm0 # xs + jmp .Ladjust_region + +.align 32 +.Lcosfregion: +# region 1 or 3 do a cosf calculation +# use simply polynomial +# 1.0 - x*x*0.5; + movsd .L__real_3ff0000000000000(%rip), %xmm0 # 1.0 + mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 *x^2 + subsd %xmm2, %xmm0 # xc + jmp .Ladjust_region + +.align 32 +.Lr_small: +## if region is 1 or 3 do a cosf calc. + and %eax, %r8d + jz .Ladjust_region + +# odd + movsd .L__real_3ff0000000000000(%rip), %xmm0 # cosf(r) is a 1 + jmp .Ladjust_region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lsinf_reduce_precise: +# // Reduce x into range [-pi/4,pi/4] +# __amd_remainder_piby2d2f(x, &r, ®ion); + + mov %r11,p_temp(%rsp) + lea region(%rsp), %rdx + lea r(%rsp), %rsi + movd %xmm0, %rdi + sub $0x20, %rsp + + call __amd_remainder_piby2d2f@PLT + + add $0x20, %rsp + mov p_temp(%rsp), %r11 + mov $1, %r8d # for determining region later on + movsd r(%rsp), %xmm1 #//x + mov region(%rsp), %eax #//region + +# xmm0 = x, xmm4 = xx, r8d = 1, eax= region +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# perform taylor series to calc sinfx, cosfx +.Lsinf_piby4: +# x2 = r * r; + movsd %xmm0, %xmm2 + mulsd %xmm0, %xmm2 #x2 + +## if region is 0 or 2 do a sinf calc. + and %eax, %r8d + jnz .Lcosfregion2 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# region 0 or 2 do a sinf calculation +# zs = x + x3((s1 + x2 * s2) + x4(s3 + x2 * s4)); + movsd .Lsinfarray+0x30(%rip), %xmm1 # s4 + mulsd %xmm2, %xmm1 # s4x2 + movsd %xmm2, %xmm4 # move for x4 + mulsd %xmm2, %xmm4 # x4 + movsd .Lsinfarray+0x10(%rip), %xmm5 # s2 + mulsd %xmm2, %xmm5 # s2x2 + movsd %xmm0, %xmm3 # move for x3 + mulsd %xmm2, %xmm3 # x3 + addsd .Lsinfarray+0x20(%rip), %xmm1 # s3+s4x2 + mulsd %xmm4, %xmm1 # s3x4+s4x6 + addsd .Lsinfarray(%rip), %xmm5 # s1+s2x2 + addsd %xmm5, %xmm1 # s1+s2x2+s3x4+s4x6 + mulsd %xmm3, %xmm1 # x3(s1+s2x2+s3x4+s4x6) + addsd %xmm1, %xmm0 # x + x3(s1+s2x2+s3x4+s4x6) + + jmp .Ladjust_region + +.align 32 +.Lcosfregion2: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# region 1 or 3 - do a cosf calculation +# zc = 1-0.5*x2+ c1*x4 +c2*x6 +c3*x8 + c4*x10 for a higher precision + movsd .Lcosfarray+0x40(%rip), %xmm1 # c4 + movsd %xmm2, %xmm4 # move for x4 + mulsd %xmm2, %xmm1 # c4x2 + movsd .Lcosfarray+0x20(%rip), %xmm3 # c2 + mulsd %xmm2, %xmm4 # x4 + movsd .Lcosfarray(%rip), %xmm0 # c0 + mulsd %xmm2, %xmm3 # c2x2 + mulsd %xmm2, %xmm0 # c0x2 (=-0.5x2) + addsd .Lcosfarray+0x30(%rip), %xmm1 # c3+c4x2 + mulsd %xmm4, %xmm1 # c3x4 + c4x6 + addsd .Lcosfarray+0x10(%rip), %xmm3 # c1+c2x2 + addsd %xmm3, %xmm1 # c1 + c2x2 + c3x4 + c4x6 + mulsd %xmm4, %xmm1 # c1x4 + c2x6 + c3x8 + c4x10 + addsd .L__real_3ff0000000000000(%rip), %xmm0 # 1 - 0.5x2 + addsd %xmm1, %xmm0 # 1 - 0.5x2 + c1x4 + c2x6 + c3x8 + c4x10 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +.align 32 +.Ladjust_region: # positive or negative +# switch (region) + shr $1, %eax + mov %eax, %ecx + and %r11d, %eax + + not %ecx + not %r11d + and %r11d, %ecx + + or %ecx, %eax + and $1, %eax + jnz .Lsinf_cleanup + +## if the original region 0, 1 and arg is negative, then we negate the result. +## if the original region 2, 3 and arg is positive, then we negate the result. + movsd %xmm0, %xmm2 + xorpd %xmm0, %xmm0 + subsd %xmm2, %xmm0 + +.align 32 +.Lsinf_cleanup: + cvtsd2ss %xmm0, %xmm0 + add $stack_size, %rsp + ret + +.align 32 +.Lsinf_naninf: + call fname_special + add $stack_size, %rsp + ret + +
diff --git a/src/gas/trunc.S b/src/gas/trunc.S new file mode 100644 index 0000000..c29d0fd --- /dev/null +++ b/src/gas/trunc.S
@@ -0,0 +1,87 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# trunc.S +# +# An implementation of the trunc libm function. +# +# The trunc functions round their argument to the integer value, in floating format, +# nearest to but no larger in magnitude than the argument. +# +# +# Prototype: +# +# double trunc(double x); +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(trunc) + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + MOVAPD %xmm0,%xmm1 + +#convert double to integer. + CVTTSD2SIQ %xmm0,%rax + CMP .L__Erro_mask(%rip),%rax + jz .Error_val +#convert integer to double + CVTSI2SDQ %rax,%xmm0 + + PSRLQ $63,%xmm1 + PSLLQ $63,%xmm1 + + POR %xmm1,%xmm0 + + + ret + +.Error_val: + MOVAPD %xmm1,%xmm2 + CMPEQSD %xmm1,%xmm1 + ADDSD %xmm2,%xmm2 + + PAND %xmm1,%xmm0 + PANDN %xmm2,%xmm1 + POR %xmm1,%xmm0 + + + ret + +.data +.align 16 +.L__Erro_mask: .quad 0x8000000000000000 + .quad 0x0
diff --git a/src/gas/truncf.S b/src/gas/truncf.S new file mode 100644 index 0000000..c73ad8f --- /dev/null +++ b/src/gas/truncf.S
@@ -0,0 +1,93 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + +# truncf.S +# +# An implementation of the truncf libm function. +# +# +# The trunf functions round their argument to the integer value, in floating format, +# nearest to but no larger in magnitude than the argument. +# +# +# Prototype: +# +# float truncf(float x); +# + +# +# Algorithm: +# + +#include "fn_macros.h" +#define fname FN_PROTOTYPE(truncf ) + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.text +.align 16 +.p2align 4,,15 +.globl fname +.type fname,@function +fname: + + + MOVAPD %xmm0,%xmm1 + +# convert float to integer. + CVTTSS2SIQ %xmm0,%rax + + CMP .L__Erro_mask(%rip),%rax + jz .Error_val + +# convert integer to float + CVTSI2SSQ %rax,%xmm0 + + PSRLD $31,%xmm1 + PSLLD $31,%xmm1 + + POR %xmm1,%xmm0 + + + ret + +.Error_val: + MOVAPD %xmm1,%xmm2 + CMPEQSS %xmm1,%xmm1 + ADDSS %xmm2,%xmm2 + + PAND %xmm1,%xmm0 + PANDN %xmm2,%xmm1 + POR %xmm1,%xmm0 + + + + + ret + +.data +.align 16 +.L__Erro_mask: .quad 0x8000000000000000 + .quad 0x0
diff --git a/src/gas/v4hcosl.S b/src/gas/v4hcosl.S new file mode 100644 index 0000000..a3ded17 --- /dev/null +++ b/src/gas/v4hcosl.S
@@ -0,0 +1,62 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# v4hcosl.s +# +# Helper routines for testing the x4 double and x8 single vector +# math functions. +# +# Prototype: +# +# void v4cos(__m128d x1, __m128d x2, double * ya); +# +# Computes 4 cos values simultaneously and returns them +# in the v4a array. +# Assumes that ya is 16 byte aligned. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# xmm0 - __m128d x1 +# xmm1 - __m128d x2 +# rdi - double *ya + +.extern __vrd4_cos + .text + .align 16 + .p2align 4,,15 +.globl v4cos + .type v4cos,@function +v4cos: + push %rdi + call __vrd4_cos@PLT + pop %rdi + movdqa %xmm0,(%rdi) + movdqa %xmm1,16(%rdi) + ret
diff --git a/src/gas/v4helpl.S b/src/gas/v4helpl.S new file mode 100644 index 0000000..02fa080 --- /dev/null +++ b/src/gas/v4helpl.S
@@ -0,0 +1,83 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# v4help.s +# +# Helper routines for testing the x4 double and x8 single vector +# math functions. +# +# Prototype: +# +# void v4exp(__m128d x1, __m128d x2, double * ya); +# +# Computes 4 exp values simultaneously and returns them +# in the v4a array. +# Assumes that ya is 16 byte aligned. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + +# %xmm0 - __m128d x1 +# %xmm1 - __m128d x2 +# rdi - double *ya + +.extern __vrd4_exp + .text + .align 16 + .p2align 4,,15 +.globl v4exp + .type v4exp,@function +v4exp: + push %rdi + call __vrd4_exp@PLT + pop %rdi + movdqa %xmm0,(%rdi) + movdqa %xmm1,16(%rdi) + ret + + +# %xmm0,%rcx - __m128d x1 +# %xmm1,%rdx - __m128d x2 +# r8 - double *ya + +.extern __vrs8_expf + .text + .align 16 + .p2align 4,,15 +.globl v8expf + .type v8expf,@function +v8expf: + push %rdi + call __vrs8_expf@PLT + pop %rdi + movdqa %xmm0,(%rdi) + movdqa %xmm1,16(%rdi) + ret +
diff --git a/src/gas/v4hfrcpal.S b/src/gas/v4hfrcpal.S new file mode 100644 index 0000000..d648d9d --- /dev/null +++ b/src/gas/v4hfrcpal.S
@@ -0,0 +1,63 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# v4hfrcpal.s +# +# Helper routines for testing the x4 double and x8 single vector +# math functions. +# +# Prototype: +# +# void v4frcpa(__m128d x1, __m128d x2, double * ya); +# +# Computes 4 frcpa values simultaneously and returns them +# in the v4a array. +# Assumes that ya is 16 byte aligned. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# xmm0 - __m128d x1 +# xmm1 - __m128d x2 +# rdi - double *ya + +.extern __vrd4_frcpa + .text + .align 16 + .p2align 4,,15 +.globl v4frcpa + .type v4frcpa,@function +v4frcpa: + push %rdi + call __vrd4_frcpa@PLT + pop %rdi + movdqa %xmm0,(%rdi) + movdqa %xmm1,16(%rdi) + ret +
diff --git a/src/gas/v4hlog10l.S b/src/gas/v4hlog10l.S new file mode 100644 index 0000000..0cdb6ba --- /dev/null +++ b/src/gas/v4hlog10l.S
@@ -0,0 +1,81 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# v4hlog10l.s +# +# Helper routines for testing the x4 double and x8 single vector +# math functions. +# +# Prototype: +# +# void v4log10(__m128d x1, __m128d x2, double * ya); +# +# Computes 4 log10 values simultaneously and returns them +# in the v4a array. +# Assumes that ya is 16 byte aligned. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# xmm0 - __m128d x1 +# xmm1 - __m128d x2 +# rdi - double *ya + +.extern __vrd4_log10 + .text + .align 16 + .p2align 4,,15 +.globl v4log10 + .type v4log10,@function +v4log10: + push %rdi + call __vrd4_log10@PLT + pop %rdi + movdqa %xmm0,(%rdi) + movdqa %xmm1,16(%rdi) + ret + +# xmm0 - __m128 x1 +# xmm1 - __m128 x2 +# rdi - single *ya + +.extern __vrs8_log10f + .text + .align 16 + .p2align 4,,15 +.globl v8log10f + .type v8log10f,@function +v8log10f: + push %rdi + call __vrs8_log10f@PLT + pop %rdi + movdqa %xmm0,(%rdi) + movdqa %xmm1,16(%rdi) + + ret
diff --git a/src/gas/v4hlog2l.S b/src/gas/v4hlog2l.S new file mode 100644 index 0000000..1a8c33e --- /dev/null +++ b/src/gas/v4hlog2l.S
@@ -0,0 +1,81 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# v4hlog10l.s +# +# Helper routines for testing the x4 double and x8 single vector +# math functions. +# +# Prototype: +# +# void v4log2(__m128d x1, __m128d x2, double * ya); +# +# Computes 4 log2 values simultaneously and returns them +# in the v4a array. +# Assumes that ya is 16 byte aligned. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# xmm0 - __m128d x1 +# xmm1 - __m128d x2 +# rdi - double *ya + +.extern __vrd4_log2 + .text + .align 16 + .p2align 4,,15 +.globl v4log2 + .type v4log2,@function +v4log2: + push %rdi + call __vrd4_log2@PLT + pop %rdi + movdqa %xmm0,(%rdi) + movdqa %xmm1,16(%rdi) + ret + +# xmm0 - __m128 x1 +# xmm1 - __m128 x2 +# rdi - single *ya + +.extern __vrs8_log2f + .text + .align 16 + .p2align 4,,15 +.globl v8log2f + .type v8log2f,@function +v8log2f: + push %rdi + call __vrs8_log2f@PLT + pop %rdi + movdqa %xmm0,(%rdi) + movdqa %xmm1,16(%rdi) + + ret
diff --git a/src/gas/v4hlogl.S b/src/gas/v4hlogl.S new file mode 100644 index 0000000..512648d --- /dev/null +++ b/src/gas/v4hlogl.S
@@ -0,0 +1,84 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# v4hlog.asm +# +# Helper routines for testing the x4 double and x8 single vector +# math functions. +# +# Prototype: +# +# void v4log(__m128d x1, __m128d x2, double * ya); +# +# Computes 4 log values simultaneously and returns them +# in the v4a array. +# Assumes that ya is 16 byte aligned. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + +# xmm0 - __m128d x1 +# xmm1 - __m128d x2 +# rdi - double *ya + +.extern __vrd4_log + .text + .align 16 + .p2align 4,,15 +.globl v4log + .type v4log,@function +v4log: + push %rdi + call __vrd4_log@PLT + pop %rdi + movdqa %xmm0,(%rdi) + movdqa %xmm1,16(%rdi) + ret + + +# xmm0 - __m128 x1 +# xmm1 - __m128 x2 +# rdi - double *ya + +#.extern __vrs8_logf + .text + .align 16 + .p2align 4,,15 +.globl v8logf + .type v8logf,@function +v8logf: + push %rdi + call __vrs8_logf@PLT + pop %rdi + movdqa %xmm0,(%rdi) + movdqa %xmm1,16(%rdi) + + ret +
diff --git a/src/gas/v4hsinl.S b/src/gas/v4hsinl.S new file mode 100644 index 0000000..97bfa2d --- /dev/null +++ b/src/gas/v4hsinl.S
@@ -0,0 +1,62 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# v4hsinl.s +# +# Helper routines for testing the x4 double and x8 single vector +# math functions. +# +# Prototype: +# +# void v4sin(__m128d x1, __m128d x2, double * ya); +# +# Computes 4 sin values simultaneously and returns them +# in the v4a array. +# Assumes that ya is 16 byte aligned. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# xmm0 - __m128d x1 +# xmm1 - __m128d x2 +# rdi - double *ya + +.extern __vrd4_sin + .text + .align 16 + .p2align 4,,15 +.globl v4sin + .type v4sin,@function +v4sin: + push %rdi + call __vrd4_sin@PLT + pop %rdi + movdqa %xmm0,(%rdi) + movdqa %xmm1,16(%rdi) + ret
diff --git a/src/gas/vrd2cos.S b/src/gas/vrd2cos.S new file mode 100644 index 0000000..d12a156 --- /dev/null +++ b/src/gas/vrd2cos.S
@@ -0,0 +1,756 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# A vector implementation of the libm cos function. +# +# Prototype: +# +# __m128d __vrd2_cos(__m128d x); +# +# Computes Cosine of x +# It will provide proper C99 return values, +# but may not raise floating point status bits properly. +# Based on the NAG C implementation. +# Author: Harsha Jagasia +# Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 16 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # + +.Lcosarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03fa5555555555555 + .quad 0x0bf56c16c16c16967 # -0.00138889 c2 + .quad 0x0bf56c16c16c16967 + .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3 + .quad 0x03efa01a019f4ec90 + .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4 + .quad 0x0be927e4fa17f65f6 + .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5 + .quad 0x03e21eeb69037ab78 + .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6 + .quad 0x0bda907db46cc5e42 +.Lsinarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bfc5555555555555 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03f81111111110bb3 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0bf2a01a019e83e5c + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03ec71de3796cde01 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0be5ae600b42fdfa7 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x03de5e0b2f9a43bb8 +.Lsincosarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x0bf56c16c16c16967 # c2 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x03efa01a019f4ec90 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x0be927e4fa17f65f6 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x03e21eeb69037ab78 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x0bda907db46cc5e42 +.Lcossinarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bf56c16c16c16967 # c2 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03efa01a019f4ec90 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0be927e4fa17f65f6 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03e21eeb69037ab78 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0bda907db46cc5e42 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + +.text +.align 16 +.p2align 4,,15 + +.equ p_temp, 0x00 # temporary for get/put bits operation +.equ p_temp1,0x10 # temporary for get/put bits operation +.equ p_temp2,0x20 # temporary for get/put bits operation +.equ p_xmm6, 0x30 # temporary for get/put bits operation +.equ p_xmm7, 0x40 # temporary for get/put bits operation +.equ p_xmm8, 0x50 # temporary for get/put bits operation +.equ p_xmm9, 0x60 # temporary for get/put bits operation +.equ p_xmm10,0x70 # temporary for get/put bits operation +.equ p_xmm11,0x80 # temporary for get/put bits operation +.equ p_xmm12,0x90 # temporary for get/put bits operation +.equ p_xmm13,0x0A0 # temporary for get/put bits operation +.equ p_xmm14,0x0B0 # temporary for get/put bits operation +.equ p_xmm15,0x0C0 # temporary for get/put bits operation +.equ r, 0x0D0 # pointer to r for remainder_piby2 +.equ rr, 0x0E0 # pointer to r for remainder_piby2 +.equ region, 0x0F0 # pointer to r for remainder_piby2 +.equ p_original,0x100 # original x +.equ p_mask, 0x110 # original x +.equ p_sign, 0x120 # original x + +.globl __vrd2_cos + .type __vrd2_cos,@function +__vrd2_cos: + sub $0x138,%rsp + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#STARTMAIN +movdqa %xmm0, p_original(%rsp) +andpd .L__real_7fffffffffffffff(%rip),%xmm0 +movdqa %xmm0, p_temp(%rsp) +mov $0x3FE921FB54442D18,%rdx #piby4 +mov $0x411E848000000000,%r10 #5e5 +movapd .L__real_v2p__27(%rip),%xmm4 #for later use + +movapd %xmm0,%xmm2 #x +movapd %xmm0,%xmm4 #x + +mov p_temp(%rsp),%rax #rax = lower arg +mov p_temp+8(%rsp),%rcx #rcx = upper arg +movapd .L__real_3fe0000000000000(%rip),%xmm5 #0.5 for later use + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Leither_or_both_arg_gt_than_piby4: + cmp %r10,%rax #is lower arg >= 5e5 + jae .Llower_or_both_arg_gt_5e5 + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lupper_arg_gt_5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lboth_arg_lt_than_5e5: +# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + movapd .L__real_3ff921fb54400000(%rip),%xmm3 # xmm3=piby2_1 + addpd %xmm5,%xmm2 # xmm2 = npi2 = x*twobypi+0.5 + movapd .L__real_3dd0b4611a600000(%rip),%xmm1 # xmm1=piby2_2 + movapd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6=piby2_2tail + cvttpd2dq %xmm2,%xmm0 # xmm0=convert npi2 to ints + cvtdq2pd %xmm0,%xmm2 # xmm2=and back to double. + + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm3 # npi2 * piby2_1 + subpd %xmm3,%xmm4 # xmm4 = rhead=x-npi2*piby2_1 + +#t = rhead; + movapd %xmm4,%xmm5 # xmm5=t=rhead + +#rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm1 # xmm1= npi2*piby2_2 + +#rhead = t - rtail; + subpd %xmm1,%xmm4 # xmm4= rhead = t-rtail + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd %xmm2,%xmm6 # npi2 * piby2_2tail + subpd %xmm4,%xmm5 # t-rhead + subpd %xmm5,%xmm1 # rtail-(t - rhead) + addpd %xmm6,%xmm1 # rtail=npi2*piby2_2+(rtail-(t-rhead)) + +#r = rhead - rtail +#rr=(rhead-r) -rtail +#Sign +#Region + movdqa %xmm0,%xmm5 # Sign + movdqa %xmm0,%xmm6 # Region + movdqa %xmm4,%xmm0 # rhead (handle xmm0 retype) + + paddd .L__reald_one_one(%rip),%xmm6 # Sign + pand .L__reald_two_two(%rip),%xmm6 + punpckldq %xmm6,%xmm6 + psllq $62,%xmm6 # xmm6 is in Int format + + subpd %xmm1,%xmm0 # rhead - rtail + pand .L__reald_one_one(%rip),%xmm5 # Odd/Even region for Sin/Cos + mov .L__reald_one_zero(%rip),%r9 # Compare value for sincos + subpd %xmm0,%xmm4 # rr=rhead-r + movd %xmm5,%r8 # Region + movapd %xmm0,%xmm2 # Move for x2 + movdqa %xmm6,%xmm6 # handle xmm6 retype + mulpd %xmm0,%xmm2 # x2 + subpd %xmm1,%xmm4 # rr=(rhead-r) -rtail + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#xmm0= x, xmm2 = x2, xmm4 = xx, r8 = region, r9 = compare value for sincos path, xmm6 = Sign + +.align 16 +.L__vrd2_cos_approximate: + cmp $0,%r8 + jnz .Lvrd2_not_cos_piby4 + +.Lvrd2_cos_piby4: + mulpd %xmm0,%xmm4 # x*xx + movdqa .L__real_3fe0000000000000(%rip),%xmm5 # 0.5 (handle xmm5 retype) + movapd .Lcosarray+0x50(%rip),%xmm1 # c6 + movapd .Lcosarray+0x20(%rip),%xmm0 # c3 + mulpd %xmm2,%xmm5 # r = 0.5 *x2 + movapd %xmm2,%xmm3 # copy of x2 for x4 + movapd %xmm4,p_temp(%rsp) # store x*xx + mulpd %xmm2,%xmm1 # c6*x2 + mulpd %xmm2,%xmm0 # c3*x2 + subpd .L__real_3ff0000000000000(%rip),%xmm5 # -t=r-1.0 + mulpd %xmm2,%xmm3 # x4 + addpd .Lcosarray+0x40(%rip),%xmm1 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm0 # c2+x2C3 + addpd .L__real_3ff0000000000000(%rip),%xmm5 # 1 + (-t) + mulpd %xmm2,%xmm3 # x6 + mulpd %xmm2,%xmm1 # x2(c5+x2c6) + mulpd %xmm2,%xmm0 # x2(c2+x2C3) + movapd %xmm2,%xmm4 # copy of x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm4 # r = 0.5 *x2 + addpd .Lcosarray+0x30(%rip),%xmm1 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm0 # c1+x2(c2+x2C3) + mulpd %xmm2,%xmm2 # x4 + subpd %xmm4,%xmm5 # (1 + (-t)) - r + mulpd %xmm3,%xmm1 # x6(c4 + x2(c5+x2c6)) + addpd %xmm1,%xmm0 # zc + subpd .L__real_3ff0000000000000(%rip),%xmm4 # -t=r-1.0 + subpd p_temp(%rsp),%xmm5 # ((1 + (-t)) - r) - x*xx + mulpd %xmm2,%xmm0 # x4 * zc + addpd %xmm5,%xmm0 # x4 * zc + ((1 + (-t)) - r -x*xx) + subpd %xmm4,%xmm0 # result - (-t) + xorpd %xmm6,%xmm0 # xor with sign + jmp .L__vrd2_cos_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lvrd2_not_cos_piby4: + cmp $1,%r8 + jnz .Lvrd2_not_cos_sin_piby4 + +.Lvrd2_cos_sin_piby4: + + movdqa %xmm6,p_temp1(%rsp) # Store Sign + movapd %xmm4,p_temp(%rsp) # Store rr + + movapd .Lsincosarray+0x50(%rip),%xmm3 # s6 + mulpd %xmm2,%xmm3 # x2s6 + movdqa .Lsincosarray+0x20(%rip),%xmm5 # s3 (handle xmm5 retype) + movapd %xmm2,%xmm1 # move x2 for x4 + mulpd %xmm2,%xmm1 # x4 + mulpd %xmm2,%xmm5 # x2s3 + addpd .Lsincosarray+0x40(%rip),%xmm3 # s5+x2s6 + movapd %xmm2,%xmm4 # move x2 for x6 + mulpd %xmm2,%xmm3 # x2(s5+x2s6) + mulpd %xmm1,%xmm4 # x6 + addpd .Lsincosarray+0x10(%rip),%xmm5 # s2+x2s3 + mulpd %xmm2,%xmm5 # x2(s2+x2s3) + addpd .Lsincosarray+0x30(%rip),%xmm3 # s4+x2(s5+x2s6) + + movhlps %xmm1,%xmm1 # move high x4 for cos + mulpd %xmm4,%xmm3 # x6(s4+x2(s5+x2s6)) + addpd .Lsincosarray(%rip),%xmm5 # s1+x2(s2+x2s3) + movapd %xmm2,%xmm4 # move low x2 for x3 + mulsd %xmm0,%xmm4 # get low x3 for sin term + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 + + addpd %xmm3,%xmm5 # z + movhlps %xmm2,%xmm6 # move high r for cos + movhlps %xmm5,%xmm3 # xmm5 = sin + # xmm3 = cos + + + mulsd p_temp(%rsp),%xmm2 # 0.5 * x2 * xx + + mulsd %xmm4,%xmm5 # sin *x3 + movsd .L__real_3ff0000000000000(%rip),%xmm4 # 1.0 + mulsd %xmm1,%xmm3 # cos *x4 + subsd %xmm6,%xmm4 # t=1.0-r + + movhlps %xmm0,%xmm1 + subsd %xmm2,%xmm5 # sin - 0.5 * x2 *xx + + mulsd p_temp+8(%rsp),%xmm1 # x * xx + movsd .L__real_3ff0000000000000(%rip),%xmm2 # 1 + subsd %xmm4,%xmm2 # 1 - t + addsd p_temp(%rsp),%xmm5 # sin+xx + + subsd %xmm6,%xmm2 # (1-t) - r + subsd %xmm1,%xmm2 # ((1 + (-t)) - r) - x*xx + addsd %xmm5,%xmm0 # sin + x + addsd %xmm2,%xmm3 # cos+((1-t)-r - x*xx) + addsd %xmm4,%xmm3 # cos+t + + movapd p_temp1(%rsp),%xmm5 # load sign + movlhps %xmm3,%xmm0 + xorpd %xmm5,%xmm0 + jmp .L__vrd2_cos_cleanup + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lvrd2_not_cos_sin_piby4: + cmp %r9,%r8 + jnz .Lvrd2_sin_piby4 + +.Lvrd2_sin_cos_piby4: + + movapd %xmm4,p_temp(%rsp) # rr move to to memory + movapd %xmm0,p_temp1(%rsp) # r move to to memory + movapd %xmm6,p_sign(%rsp) + + movapd .Lcossinarray+0x50(%rip),%xmm3 # s6 + mulpd %xmm2,%xmm3 # x2s6 + movdqa .Lcossinarray+0x20(%rip),%xmm5 # s3 + movapd %xmm2,%xmm1 # move x2 for x4 + mulpd %xmm2,%xmm1 # x4 + mulpd %xmm2,%xmm5 # x2s3 + + addpd .Lcossinarray+0x40(%rip),%xmm3 # s5+x2s6 + movapd %xmm2,%xmm4 # move for x6 + mulpd %xmm2,%xmm3 # x2(s5+x2s6) + mulpd %xmm1,%xmm4 # x6 + addpd .Lcossinarray+0x10(%rip),%xmm5 # s2+x2s3 + mulpd %xmm2,%xmm5 # x2(s2+x2s3) + addpd .Lcossinarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6) + + movhlps %xmm0,%xmm0 # high of x for x3 + mulpd %xmm4,%xmm3 # x6(s4 + x2(s5+x2s6)) + addpd .Lcossinarray(%rip),%xmm5 # s1+x2(s2+x2s3) + + movhlps %xmm2,%xmm4 # high of x2 for x3 + + addpd %xmm5,%xmm3 # z + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 + mulsd %xmm0,%xmm4 # x3 # + movhlps %xmm3,%xmm5 # xmm5 = sin + # xmm3 = cos + + mulsd %xmm4,%xmm5 # sin*x3 # + movsd .L__real_3ff0000000000000(%rip),%xmm4 # 1.0 # + mulsd %xmm1,%xmm3 # cos*x4 # + + subsd %xmm2,%xmm4 # t=1.0-r # + + movhlps %xmm2,%xmm6 # move 0.5 * x2 for 0.5 * x2 * xx # + mulsd p_temp+8(%rsp),%xmm6 # 0.5 * x2 * xx # + subsd %xmm6,%xmm5 # sin - 0.5 * x2 *xx # + addsd p_temp+8(%rsp),%xmm5 # sin+xx # + + movlpd p_temp1(%rsp),%xmm6 # x + mulsd p_temp(%rsp),%xmm6 # x *xx # + + movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1 # + subsd %xmm4,%xmm1 # 1 -t # + addsd %xmm5,%xmm0 # sin+x # + subsd %xmm2,%xmm1 # (1-t) - r # + subsd %xmm6,%xmm1 # ((1 + (-t)) - r) - x*xx # + addsd %xmm1,%xmm3 # cos+((1 + (-t)) - r) - x*xx # + addsd %xmm4,%xmm3 # cos+t # + + movapd p_sign(%rsp),%xmm2 # load sign + movlhps %xmm0,%xmm3 + movapd %xmm3,%xmm0 + xorpd %xmm2,%xmm0 + jmp .L__vrd2_cos_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lvrd2_sin_piby4: + movapd .Lsinarray+0x50(%rip),%xmm3 # s6 + mulpd %xmm2,%xmm3 # x2s6 + movapd .Lsinarray+0x20(%rip),%xmm5 # s3 + movapd %xmm4,p_temp(%rsp) # store xx + movapd %xmm2,%xmm1 # move for x4 + mulpd %xmm2,%xmm1 # x4 + movapd %xmm0,p_temp1(%rsp) # store x + + mulpd %xmm2,%xmm5 # x2s3 + movapd %xmm0,%xmm4 # move for x3 + addpd .Lsinarray+0x40(%rip),%xmm3 # s5+x2s6 + mulpd %xmm2,%xmm1 # x6 + mulpd %xmm2,%xmm3 # x2(s5+x2s6) + mulpd %xmm2,%xmm4 # x3 + addpd .Lsinarray+0x10(%rip),%xmm5 # s2+x2s3 + mulpd %xmm2,%xmm5 # x2(s2+x2s3) + addpd .Lsinarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6) + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2 + + movapd p_temp(%rsp),%xmm0 # load xx + mulpd %xmm1,%xmm3 # x6(s4 + x2(s5+x2s6)) + addpd .Lsinarray(%rip),%xmm5 # s1+x2(s2+x2s3) + mulpd %xmm0,%xmm2 # 0.5 * x2 *xx + addpd %xmm5,%xmm3 # zs + mulpd %xmm3,%xmm4 # *x3 + subpd %xmm2,%xmm4 # x3*zs - 0.5 * x2 *xx + addpd %xmm4,%xmm0 # +xx + addpd p_temp1(%rsp),%xmm0 # +x + + xorpd %xmm6,%xmm0 # xor sign + jmp .L__vrd2_cos_cleanup + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Llower_or_both_arg_gt_5e5: + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 + + movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm4,%xmm4 + +# Work on Upper arg +# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5 +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + +#If upper Arg is <=piby4 + cmp %rdx,%rcx # is upper arg > piby4 + ja 0f + + mov $0,%ecx # region = 0 + mov %ecx,region+4(%rsp) # store upper region + movlpd %xmm0,r+8(%rsp) # store upper r + xorpd %xmm4,%xmm4 # rr = 0 + movlpd %xmm4,rr+8(%rsp) # store upper rr + jmp .Lcheck_lower_arg + +.align 16 +0: +#If upper Arg is > piby4 + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm5,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm3 # xmm3 = piby2_1 + cvttsd2si %xmm2,%ecx # xmm0 = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm1 # xmm1 = piby2_2 + cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm3 # npi2 * piby2_1 + subsd %xmm3,%xmm4 # xmm4 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm4,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm1 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm1,%xmm4 # xmm4 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm6 # npi2 * piby2_2tail + subsd %xmm4,%xmm5 # t-rhead + subsd %xmm5,%xmm1 # (rtail-(t-rhead)) + addsd %xmm6,%xmm1 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm4,%xmm0 + subsd %xmm1,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm4 # rr=rhead-r + subsd %xmm1,%xmm4 # xmm4 = rr=((rhead-r) -rtail) + movlpd %xmm0,r+8(%rsp) # store upper r + movlpd %xmm4,rr+8(%rsp) # store upper rr + +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign +.align 16 +.Lcheck_lower_arg: + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrd2_cos_lower_naninf + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd2_cos_lower_naninf: + mov p_original(%rsp),%rax # upper arg is nan/inf + + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) # rr = 0 + mov %r10d,region(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrd2_cos_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lupper_arg_gt_5e5: +# Upper Arg is >= 5e5, Lower arg is < 5e5 + movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call + movlhps %xmm0,%xmm0 #Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case + movlhps %xmm2,%xmm2 + movlhps %xmm4,%xmm4 + + +# Work on Lower arg +# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5 +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + +#If lower Arg is <=piby4 + cmp %rdx,%rax # is upper arg > piby4 + ja 0f + + mov $0,%eax # region = 0 + mov %eax,region(%rsp) # store upper region + movlpd %xmm0,r(%rsp) # store upper r + xorpd %xmm4,%xmm4 # rr = 0 + movlpd %xmm4,rr(%rsp) # store upper rr + jmp .Lcheck_upper_arg + +.align 16 +0: +#If upper Arg is > piby4 + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm5,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm3 # xmm3 = piby2_1 + cvttsd2si %xmm2,%eax # xmm0 = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm1 # xmm1 = piby2_2 + cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm3 # npi2 * piby2_1 + subsd %xmm3,%xmm4 # xmm4 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm4,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm1 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm1,%xmm4 # xmm4 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm6 # npi2 * piby2_2tail + subsd %xmm4,%xmm5 # t-rhead + subsd %xmm5,%xmm1 # (rtail-(t-rhead)) + addsd %xmm6,%xmm1 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store lower region + movsd %xmm4,%xmm0 + subsd %xmm1,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm4 # rr=rhead-r + subsd %xmm1,%xmm4 # xmm4 = rr=((rhead-r) -rtail) + movlpd %xmm0,r(%rsp) # store lower r + movlpd %xmm4,rr(%rsp) # store lower rr + +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign +.align 16 +.Lcheck_upper_arg: + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrd2_cos_upper_naninf + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd2_cos_upper_naninf: + mov p_original+8(%rsp),%rcx # upper arg is nan/inf + + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) # rr = 0 + mov %r10d,region+4(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrd2_cos_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 + +# movhlps %xmm0, %xmm6 #Save upper fp arg for remainder_piby2 call + movhpd %xmm0, p_temp1(%rsp) #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrd2_cos_lower_naninf_of_both_gt_5e5 + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + mov %rcx,p_temp(%rsp) #Save upper arg + call __amd_remainder_piby2@PLT + mov p_temp(%rsp),%rcx #Restore upper arg + jmp 0f + +.L__vrd2_cos_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf + mov p_original(%rsp),%rax + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) #rr = 0 + mov %r10d,region(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrd2_cos_upper_naninf_of_both_gt_5e5 + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd p_temp1(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd2_cos_upper_naninf_of_both_gt_5e5: + mov p_original+8(%rsp),%rcx #upper arg is nan/inf +# movd %xmm6,%rcx ;upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) #rr = 0 + mov %r10d,region+4(%rsp) #region = 0 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +0: +.L__vrd2_cos_reconstruct: +#Construct xmm0=x, xmm2 =x2, xmm4=xx, r8=region, xmm6=sign + movapd r(%rsp),%xmm0 #x + movapd %xmm0,%xmm2 #move for x2 + mulpd %xmm2,%xmm2 #x2 + + movapd rr(%rsp),%xmm4 #xx + + mov region(%rsp),%r8 + mov .L__reald_one_zero(%rip),%r9 #compare value for sincos path + mov %r8,%r10 + and .L__reald_one_one(%rip),%r8 #odd/even region for sin/cos + add .L__reald_one_one(%rip),%r10 + and .L__reald_two_two(%rip),%r10 + mov %r10,%r11 + and .L__reald_two_zero(%rip),%r11 #mask out the lower sign bit leaving the upper sign bit + shl $62,%r10 #shift lower sign bit left by 63 bits + shl $30,%r11 #shift upper sign bit left by 31 bits + mov %r10,p_temp(%rsp) #write out lower sign bit + mov %r11,p_temp+8(%rsp) #write out upper sign bit + movapd p_temp(%rsp),%xmm6 #write out both sign bits to xmm6 + + jmp .L__vrd2_cos_approximate + +#ENDMAIN + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrd2_cos_cleanup: + add $0x138,%rsp + ret
diff --git a/src/gas/vrd2exp.S b/src/gas/vrd2exp.S new file mode 100644 index 0000000..b87763f --- /dev/null +++ b/src/gas/vrd2exp.S
@@ -0,0 +1,372 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# exp.asm +# +# A vector implementation of the exp libm function. +# +# Prototype: +# +# __m128d __vrd2_exp(__m128d x); +# +# Computes e raised to the x power. +# Does not perform error checking. Denormal results are truncated to 0. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + + .text + .align 16 + .p2align 4,,15 + +# define local variable storage offsets +.equ p_temp,0 # temporary for get/put bits operation +.equ p_temp1,0x10 # temporary for get/put bits operation +.equ stack_size,0x28 + + + + +.globl __vrd2_exp + .type __vrd2_exp,@function +__vrd2_exp: + sub $stack_size,%rsp + + + +# /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */ +# Step 1. Reduce the argument. + # r = x * thirtytwo_by_logbaseof2; + movapd %xmm0,p_temp(%rsp) + movapd .L__real_thirtytwo_by_log2(%rip),%xmm3 # + maxpd .L__real_C0F0000000000000(%rip),%xmm0 # protect against very large negative, non-infinite numbers + mulpd %xmm0,%xmm3 + +# save x for later. + minpd .L__real_40F0000000000000(%rip),%xmm3 # protect against very large, non-infinite numbers + +# /* Set n = nearest integer to r */ + cvtpd2dq %xmm3,%xmm4 + lea .L__two_to_jby32_lead_table(%rip),%rdi + lea .L__two_to_jby32_trail_table(%rip),%rsi + cvtdq2pd %xmm4,%xmm1 + + # r1 = x - n * logbaseof2_by_32_lead; + movapd .L__real_log2_by_32_lead(%rip),%xmm2 # + mulpd %xmm1,%xmm2 # + movq %xmm4,p_temp1(%rsp) + subpd %xmm2,%xmm0 # r1 in xmm0, + +# r2 = - n * logbaseof2_by_32_trail; + mulpd .L__real_log2_by_32_tail(%rip),%xmm1 # r2 in xmm1 +# j = n & 0x0000001f; + mov $0x01f,%r9 + mov %r9,%r8 + mov p_temp1(%rsp),%ecx + and %ecx,%r9d + + mov p_temp1+4(%rsp),%edx + and %edx,%r8d + movapd %xmm0,%xmm2 +# f1 = two_to_jby32_lead_table[j]; +# f2 = two_to_jby32_trail_table[j]; + +# *m = (n - j) / 32; + sub %r9d,%ecx + sar $5,%ecx #m + sub %r8d,%edx + sar $5,%edx + + + addpd %xmm1,%xmm2 #r = r1 + r2 + +# Step 2. Compute the polynomial. +# q = r1 + (r2 + +# r*r*( 5.00000000000000008883e-01 + +# r*( 1.66666666665260878863e-01 + +# r*( 4.16666666662260795726e-02 + +# r*( 8.33336798434219616221e-03 + +# r*( 1.38889490863777199667e-03 )))))); +# q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720 + movapd %xmm2,%xmm1 + movapd .L__real_3f56c1728d739765(%rip),%xmm3 # 1/720 + movapd .L__real_3FC5555555548F7C(%rip),%xmm0 # 1/6 +# deal with infinite results + mov $1024,%rax + movsx %ecx,%rcx + cmp %rax,%rcx + + mulpd %xmm2,%xmm3 # *x + mulpd %xmm2,%xmm0 # *x + mulpd %xmm2,%xmm1 # x*x + movapd %xmm1,%xmm4 + + cmovg %rax,%rcx ## if infinite, then set rcx to multiply + # by infinity + movsx %edx,%rdx + cmp %rax,%rdx + + addpd .L__real_3F811115B7AA905E(%rip),%xmm3 # + 1/120 + addpd .L__real_3fe0000000000000(%rip),%xmm0 # + .5 + mulpd %xmm1,%xmm4 # x^4 + mulpd %xmm2,%xmm3 # *x + + cmovg %rax,%rdx ## if infinite, then set rcx to multiply + # by infinity +# deal with denormal results + xor %rax,%rax + add $1023,%rcx # add bias + + mulpd %xmm1,%xmm0 # *x^2 + addpd .L__real_3FA5555555545D4E(%rip),%xmm3 # + 1/24 + addpd %xmm2,%xmm0 # + x + mulpd %xmm4,%xmm3 # *x^4 +# check for infinity or nan + movapd p_temp(%rsp),%xmm2 + cmovs %rax,%rcx ## if denormal, then multiply by 0 + shl $52,%rcx # build 2^n + + addpd %xmm3,%xmm0 # q = final sum + +# *z2 = f2 + ((f1 + f2) * q); + movlpd (%rsi,%r9,8),%xmm5 # f2 + movlpd (%rsi,%r8,8),%xmm4 # f2 + addsd (%rdi,%r9,8),%xmm5 # f1 + f2 + + addsd (%rdi,%r8,8),%xmm4 # f1 + f2 + shufpd $0,%xmm4,%xmm5 + + + mulpd %xmm5,%xmm0 + add $1023,%rdx # add bias + cmovs %rax,%rdx ## if denormal, then multiply by 0 + addpd %xmm5,%xmm0 #z = z1 + z2 +# end of splitexp +# /* Scale (z1 + z2) by 2.0**m */ +# r = scaleDouble_1(z, n); + +#;;; the following code moved to improve scheduling +# deal with infinite results +# mov $1024,%rax +# movsxd %ecx,%rcx +# cmp %rax,%rcx +# cmovg %rax,%rcx ; if infinite, then set rcx to multiply + # by infinity +# movsxd %edx,%rdx +# cmp %rax,%rdx +# cmovg %rax,%rdx ; if infinite, then set rcx to multiply + # by infinity + +# deal with denormal results +# xor %rax,%rax +# add $1023,%rcx ; add bias +# shl $52,%rcx ; build 2^n + +# add $1023,%rdx ; add bias + shl $52,%rdx # build 2^n + +# check for infinity or nan +# movapd p_temp(%rsp),%xmm2 + andpd .L__real_infinity(%rip),%xmm2 + cmppd $0,.L__real_infinity(%rip),%xmm2 + mov %rcx,p_temp1(%rsp) # get 2^n to memory + mov %rdx,p_temp1+8(%rsp) # get 2^n to memory + movmskpd %xmm2,%r8d + test $3,%r8d + +# Step 3. Reconstitute. + + mulpd p_temp1(%rsp),%xmm0 # result*= 2^n + +# we'd like to avoid a branch, and can use cmp's and and's to +# eliminate them. But it adds cycles for normal cases which +# are supposed to be exceptions. Using this branch with the +# check above results in faster code for the normal cases. + jnz .L__exp_naninf + +# +# +.L__final_check: + add $stack_size,%rsp + ret + +# at least one of the numbers needs special treatment +.L__exp_naninf: +# check the first number + test $1,%r8d + jz .L__check2 + + mov p_temp(%rsp),%rdx + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__enan1 # jump if mantissa not zero, so it's a NaN +# inf + mov %rdx,%rax + rcl $1,%rax + jnc .L__r1 # exp(+inf) = inf + xor %rdx,%rdx # exp(-inf) = 0 + jmp .L__r1 + +#NaN +.L__enan1: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__r1: + movd %rdx,%xmm2 + shufpd $2,%xmm0,%xmm2 + movsd %xmm2,%xmm0 +# check the second number +.L__check2: + test $2,%r8d + jz .L__final_check + mov p_temp+8(%rsp),%rdx + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__enan2 # jump if mantissa not zero, so it's a NaN +# inf + mov %rdx,%rax + rcl $1,%rax + jnc .L__r2 # exp(+inf) = inf + xor %rdx,%rdx # exp(-inf) = 0 + jmp .L__r2 + +#NaN +.L__enan2: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__r2: + movd %rdx,%xmm2 + shufpd $0,%xmm2,%xmm0 + jmp .L__final_check + + .data + .align 16 +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 # for alignment +.L__real_4040000000000000: .quad 0x04040000000000000 # 32 + .quad 0x04040000000000000 +.L__real_40F0000000000000: .quad 0x040F0000000000000 # 65536, to protect agains t really large numbers + .quad 0x040F0000000000000 +.L__real_C0F0000000000000: .quad 0x0C0F0000000000000 # -65536, to protect against really large negative numbers + .quad 0x0C0F0000000000000 +.L__real_3FA0000000000000: .quad 0x03FA0000000000000 # 1/32 + .quad 0x03FA0000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 1/2 + .quad 0x03fe0000000000000 +.L__real_infinity: .quad 0x07ff0000000000000 # + .quad 0x07ff0000000000000 # for alignment +.L__real_ninfinity: .quad 0x0fff0000000000000 # + .quad 0x0fff0000000000000 # for alignment +.L__real_thirtytwo_by_log2: .quad 0x040471547652b82fe # thirtytwo_by_log2 + .quad 0x040471547652b82fe +.L__real_log2_by_32_lead: .quad 0x03f962e42fe000000 # log2_by_32_lead + .quad 0x03f962e42fe000000 +.L__real_log2_by_32_tail: .quad 0x0Bdcf473de6af278e # -log2_by_32_tail + .quad 0x0Bdcf473de6af278e +.L__real_3f56c1728d739765: .quad 0x03f56c1728d739765 # 1.38889490863777199667e-03 + .quad 0x03f56c1728d739765 +.L__real_3F811115B7AA905E: .quad 0x03F811115B7AA905E # 8.33336798434219616221e-03 + .quad 0x03F811115B7AA905E +.L__real_3FA5555555545D4E: .quad 0x03FA5555555545D4E # 4.16666666662260795726e-02 + .quad 0x03FA5555555545D4E +.L__real_3FC5555555548F7C: .quad 0x03FC5555555548F7C # 1.66666666665260878863e-01 + .quad 0x03FC5555555548F7C + + +.L__two_to_jby32_lead_table: + .quad 0x03ff0000000000000 # 1 + .quad 0x03ff059b0d0000000 # 1.0219 + .quad 0x03ff0b55860000000 # 1.04427 + .quad 0x03ff11301d0000000 # 1.06714 + .quad 0x03ff172b830000000 # 1.09051 + .quad 0x03ff1d48730000000 # 1.11439 + .quad 0x03ff2387a60000000 # 1.13879 + .quad 0x03ff29e9df0000000 # 1.16372 + .quad 0x03ff306fe00000000 # 1.18921 + .quad 0x03ff371a730000000 # 1.21525 + .quad 0x03ff3dea640000000 # 1.24186 + .quad 0x03ff44e0860000000 # 1.26905 + .quad 0x03ff4bfdad0000000 # 1.29684 + .quad 0x03ff5342b50000000 # 1.32524 + .quad 0x03ff5ab07d0000000 # 1.35426 + .quad 0x03ff6247eb0000000 # 1.38391 + .quad 0x03ff6a09e60000000 # 1.41421 + .quad 0x03ff71f75e0000000 # 1.44518 + .quad 0x03ff7a11470000000 # 1.47683 + .quad 0x03ff8258990000000 # 1.50916 + .quad 0x03ff8ace540000000 # 1.54221 + .quad 0x03ff93737b0000000 # 1.57598 + .quad 0x03ff9c49180000000 # 1.61049 + .quad 0x03ffa5503b0000000 # 1.64576 + .quad 0x03ffae89f90000000 # 1.68179 + .quad 0x03ffb7f76f0000000 # 1.71862 + .quad 0x03ffc199bd0000000 # 1.75625 + .quad 0x03ffcb720d0000000 # 1.79471 + .quad 0x03ffd5818d0000000 # 1.83401 + .quad 0x03ffdfc9730000000 # 1.87417 + .quad 0x03ffea4afa0000000 # 1.91521 + .quad 0x03fff507650000000 # 1.95714 + .quad 0 # for alignment +.L__two_to_jby32_trail_table: + .quad 0x00000000000000000 # 0 + .quad 0x03e48ac2ba1d73e2a # 1.1489e-008 + .quad 0x03e69f3121ec53172 # 4.83347e-008 + .quad 0x03df25b50a4ebbf1b # 2.67125e-010 + .quad 0x03e68faa2f5b9bef9 # 4.65271e-008 + .quad 0x03e368b9aa7805b80 # 5.24924e-009 + .quad 0x03e6ceac470cd83f6 # 5.38622e-008 + .quad 0x03e547f7b84b09745 # 1.90902e-008 + .quad 0x03e64636e2a5bd1ab # 3.79764e-008 + .quad 0x03e5ceaa72a9c5154 # 2.69307e-008 + .quad 0x03e682468446b6824 # 4.49684e-008 + .quad 0x03e18624b40c4dbd0 # 1.41933e-009 + .quad 0x03e54d8a89c750e5e # 1.94147e-008 + .quad 0x03e5a753e077c2a0f # 2.46409e-008 + .quad 0x03e6a90a852b19260 # 4.94813e-008 + .quad 0x03e0d2ac258f87d03 # 8.48872e-010 + .quad 0x03e59fcef32422cbf # 2.42032e-008 + .quad 0x03e61d8bee7ba46e2 # 3.3242e-008 + .quad 0x03e4f580c36bea881 # 1.45957e-008 + .quad 0x03e62999c25159f11 # 3.46453e-008 + .quad 0x03e415506dadd3e2a # 8.0709e-009 + .quad 0x03e29b8bc9e8a0388 # 2.99439e-009 + .quad 0x03e451f8480e3e236 # 9.83622e-009 + .quad 0x03e41f12ae45a1224 # 8.35492e-009 + .quad 0x03e62b5a75abd0e6a # 3.48493e-008 + .quad 0x03e47daf237553d84 # 1.11085e-008 + .quad 0x03e6b0aa538444196 # 5.03689e-008 + .quad 0x03e69df20d22a0798 # 4.81896e-008 + .quad 0x03e69f7490e4bb40b # 4.83654e-008 + .quad 0x03e4bdcdaf5cb4656 # 1.29746e-008 + .quad 0x03e452486cc2c7b9d # 9.84533e-009 + .quad 0x03e66dc8a80ce9f09 # 4.25828e-008 + .quad 0 # for alignment +
diff --git a/src/gas/vrd2log.S b/src/gas/vrd2log.S new file mode 100644 index 0000000..30bb3b1 --- /dev/null +++ b/src/gas/vrd2log.S
@@ -0,0 +1,573 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrd2log.s +# +# An implementation of the log libm function. +# +# Prototype: +# +# __m128d __vrd2_log(__m128d x); +# +# Computes the natural log of x. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. Runs 105-115 cycles for valid inputs. +# +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# define local variable storage offsets +.equ p_x,0 # temporary for error checking operation +.equ p_idx,0x010 # index storage + +.equ stack_size,0x028 + + + + .text + .align 16 + .p2align 4,,15 +.globl __vrd2_log + .type __vrd2_log,@function +__vrd2_log: + sub $stack_size,%rsp + + movdqa %xmm0,p_x(%rsp) # save the input values + + +# /* Store the exponent of x in xexp and put +# f into the range [0.5,1) */ + +# +# compute the index into the log tables +# + pxor %xmm1,%xmm1 + movdqa %xmm0,%xmm3 + psrlq $52,%xmm3 + psubq .L__mask_1023(%rip),%xmm3 + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm6 # xexp + movdqa %xmm0,%xmm2 + xor %rax,%rax + subpd .L__real_one(%rip),%xmm2 + + movdqa %xmm0,%xmm3 + andpd .L__real_notsign(%rip),%xmm2 + pand .L__real_mant(%rip),%xmm3 + movdqa %xmm3,%xmm4 + movapd .L__real_half(%rip),%xmm5 # .5 + + cmppd $1,.L__real_threshold(%rip),%xmm2 + movmskpd %xmm2,%r10d + cmp $3,%r10d + jz .Lall_nearone + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrlq $45,%xmm3 + movdqa %xmm3,%xmm2 + psrlq $1,%xmm3 + paddq .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm2 + paddq %xmm2,%xmm3 + + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm1 + xor %rcx,%rcx + movq %xmm3,p_idx(%rsp) + +# reduce and get u + por .L__real_half(%rip),%xmm4 + movdqa %xmm4,%xmm2 + + + mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128 + + + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx(%rsp),%eax + + subpd %xmm1,%xmm2 # f2 = f - f1 + mulpd %xmm2,%xmm5 + addpd %xmm5,%xmm1 + + divpd %xmm1,%xmm2 # u + +# do error checking here for scheduling. Saves a bunch of cycles as +# compared to doing this at the start of the routine. +## if NaN or inf + movapd %xmm0,%xmm3 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r8d + xorpd %xmm1,%xmm1 + + cmppd $2,%xmm1,%xmm0 + movmskpd %xmm0,%r9d + +# get z + movlpd -512(%rdx,%rax,8),%xmm0 # z1 + mov p_idx+4(%rsp),%ecx + movhpd -512(%rdx,%rcx,8),%xmm0 # z1 +# solve for ln(1+u) + movapd %xmm2,%xmm1 # u + mulpd %xmm2,%xmm2 # u^2 + movapd %xmm2,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm2,%xmm3 #Cu2 + mulpd %xmm1,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm2 # u^5 + movapd .L__real_log2_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm1 # u+Au3 + mulpd %xmm3,%xmm2 # u5(B+Cu2) + + addpd %xmm2,%xmm1 # poly +# recombine + mulpd %xmm6,%xmm4 # xexp * log2_lead + addpd %xmm4,%xmm0 #r1 + lea .L__np_ln_tail_table(%rip),%rdx + movlpd -512(%rdx,%rax,8),%xmm4 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm4 #z2 +=q + addpd %xmm4,%xmm1 + + mulpd .L__real_log2_tail(%rip),%xmm6 + + addpd %xmm6,%xmm1 #r2 + +# check for nans/infs + test $3,%r8d + addpd %xmm1,%xmm0 + jnz .L__log_naninf +.L__vlog1: +# check for negative numbers or zero + test $3,%r9d + jnz .L__z_or_n + +.L__finish: +# see if we have a near one value + test $3,%r10d + jnz .L__near_one +.L__finishn1: + add $stack_size,%rsp + ret + + .align 16 +.Lall_nearone: +# saves 10 cycles +# r = x - 1.0; + movapd .L__real_two(%rip),%xmm2 + subpd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addpd %xmm0,%xmm2 + movapd %xmm0,%xmm1 + divpd %xmm2,%xmm1 # u + movapd .L__real_ca4(%rip),%xmm4 #D + movapd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movapd %xmm0,%xmm6 + mulpd %xmm1,%xmm6 # correction +# u = u + u; + addpd %xmm1,%xmm1 #u + movapd %xmm1,%xmm2 + mulpd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulpd %xmm1,%xmm5 # Cu + movapd %xmm1,%xmm3 + mulpd %xmm2,%xmm3 # u^3 + mulpd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulpd %xmm3,%xmm4 #Du^3 + + addpd .L__real_ca1(%rip),%xmm2 # +A + movapd %xmm3,%xmm1 + mulpd %xmm1,%xmm1 # u^6 + addpd %xmm4,%xmm5 #Cu+Du3 +# subsd %xmm6,%xmm0 ; -correction + + mulpd %xmm3,%xmm2 #u3(A+Bu2) + mulpd %xmm5,%xmm1 #u6(Cu+Du3) + addpd %xmm1,%xmm2 + subpd %xmm6,%xmm2 # -correction + +# return r + r2; + addpd %xmm2,%xmm0 + jmp .L__finishn1 + + .align 16 +.L__near_one: + test $1,%r10d + jz .L__lnn12 + +# movapd %xmm0,%xmm6 ; save the inputs + movlpd p_x(%rsp),%xmm0 + call .L__ln1 +# shufpd xmm0,$2,%xmm6 + +.L__lnn12: + test $2,%r10d # second number? + jz .L__lnn1e + movlpd %xmm0,p_x(%rsp) + movlpd p_x+8(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,p_x+8(%rsp) + movapd p_x(%rsp),%xmm0 +# shufpd xmm6,$0,%xmm0 +# movapd %xmm6,%xmm0 + +.L__lnn1e: + jmp .L__finishn1 + +.L__ln1: +# saves 10 cycles +# r = x - 1.0; + movlpd .L__real_two(%rip),%xmm2 + subsd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addsd %xmm0,%xmm2 + movsd %xmm0,%xmm1 + divsd %xmm2,%xmm1 # u + movlpd .L__real_ca4(%rip),%xmm4 #D + movlpd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movsd %xmm0,%xmm6 + mulsd %xmm1,%xmm6 # correction +# u = u + u; + addsd %xmm1,%xmm1 #u + movsd %xmm1,%xmm2 + mulsd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulsd %xmm1,%xmm5 # Cu + movsd %xmm1,%xmm3 + mulsd %xmm2,%xmm3 # u^3 + mulsd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulsd %xmm3,%xmm4 #Du^3 + + addsd .L__real_ca1(%rip),%xmm2 # +A + movsd %xmm3,%xmm1 + mulsd %xmm1,%xmm1 # u^6 + addsd %xmm4,%xmm5 #Cu+Du3 +# subsd %xmm6,%xmm0 ; -correction + + mulsd %xmm3,%xmm2 #u3(A+Bu2) + mulsd %xmm5,%xmm1 #u6(Cu+Du3) + addsd %xmm1,%xmm2 + subsd %xmm6,%xmm2 # -correction + +# return r + r2; + addsd %xmm2,%xmm0 + ret + + .align 16 + +# at least one of the numbers was a nan or infinity +.L__log_naninf: + test $1,%r8d # first number? + jz .L__lninf2 + + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rdx + movlpd p_x(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm1,%xmm0 + +.L__lninf2: + test $2,%r8d # second number? + jz .L__lninfe + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rdx + movlpd p_x+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + +.L__lninfe: + + cmp $3,%r8d # both numbers? + jz .L__finish # return early if so + jmp .L__vlog1 # continue processing if not + +# a subroutine to treat one number for nan/infinity +# the number is expected in rdx and returned in the low +# half of xmm0 +.L__lni: + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__lnan # jump if mantissa not zero, so it's a NaN +# inf + rcl $1,%rdx + jnc .L__lne2 # log(+inf) = inf +# negative x + movlpd .L__real_nan(%rip),%xmm0 + ret + +#NaN +.L__lnan: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__lne: + movd %rdx,%xmm0 +.L__lne2: + ret + + .align 16 + +# at least one of the numbers was a zero, a negative number, or both. +.L__z_or_n: + test $1,%r9d # first number? + jz .L__zn2 + +# movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rax + call .L__zni +# shufpd $2,%xmm1,%xmm0 + +.L__zn2: + test $2,%r9d # second number? + jz .L__zne + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + +.L__zne: + jmp .L__finish + +# a subroutine to treat one number for zero or negative values +# the number is expected in rax and returned in the low +# half of xmm0 +.L__zni: + shl $1,%rax + jnz .L__zn_x ## if just a carry, then must be negative + movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0 + ret +.L__zn_x: + movlpd .L__real_nan(%rip),%xmm0 + ret + + + + + .data + .align 16 + +.L__real_one: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_two: .quad 0x04000000000000000 # 2.0 + .quad 0x04000000000000000 +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0fff0000000000000 +.L__real_inf: .quad 0x07ff0000000000000 # +inf + .quad 0x07ff0000000000000 +.L__real_nan: .quad 0x07ff8000000000000 # NaN + .quad 0x07ff8000000000000 + +.L__real_sign: .quad 0x08000000000000000 # sign bit + .quad 0x08000000000000000 +.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit + .quad 0x07ffFFFFFFFFFFFFF +.L__real_threshold: .quad 0x03F9EB85000000000 # .03 + .quad 0x03F9EB85000000000 +.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit + .quad 0x00008000000000000 +.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000FFFFFFFFFFFFF +.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */ + .quad 0x03f80000000000000 +.L__mask_1023: .quad 0x000000000000003ff # + .quad 0x000000000000003ff +.L__mask_040: .quad 0x00000000000000040 # + .quad 0x00000000000000040 +.L__mask_001: .quad 0x00000000000000001 # + .quad 0x00000000000000001 + +.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02 + .quad 0x03fb55555555554e6 +.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02 + .quad 0x03f89999999bac6d4 +.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03 + .quad 0x03f62492307f1519f +.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04 + .quad 0x03f3c8034c85dfff0 + +.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02 + .quad 0x03fb5555555555557 +.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02 + .quad 0x03f89999999865ede +.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03 + .quad 0x03f6249423bd94741 +.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01 + .quad 0x03fe62e42e0000000 +.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08 + .quad 0x03e6efa39ef35793c + +.L__real_half: .quad 0x03fe0000000000000 # 1/2 + .quad 0x03fe0000000000000 + + .align 16 + +.L__np_ln_lead_table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02 + .quad 0x3f9f829800000000 # 3.07716131210327148438e-02 + .quad 0x3fa7745800000000 # 4.58095073699951171875e-02 + .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02 + .quad 0x3fb341d700000000 # 7.52233862876892089844e-02 + .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02 + .quad 0x3fba926d00000000 # 1.03796780109405517578e-01 + .quad 0x3fbe270700000000 # 1.17783010005950927734e-01 + .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01 + .quad 0x3fc2955280000000 # 1.45181953907012939453e-01 + .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01 + .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01 + .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01 + .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01 + .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01 + .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01 + .quad 0x3fce270700000000 # 2.35566020011901855469e-01 + .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01 + .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01 + .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01 + .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01 + .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01 + .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01 + .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01 + .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01 + .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01 + .quad 0x3fd686c800000000 # 3.51976394653320312500e-01 + .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01 + .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01 + .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01 + .quad 0x3fd9479400000000 # 3.94993782043457031250e-01 + .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01 + .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01 + .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01 + .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01 + .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01 + .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01 + .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01 + .quad 0x3fde744240000000 # 4.75845873355865478516e-01 + .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01 + .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01 + .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01 + .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01 + .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01 + .quad 0x3fe109f380000000 # 5.32464742660522460938e-01 + .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01 + .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01 + .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01 + .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01 + .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01 + .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01 + .quad 0x3fe307d720000000 # 5.94707071781158447266e-01 + .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01 + .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01 + .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01 + .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01 + .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01 + .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01 + .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01 + .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01 + .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01 + .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01 + .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01 + .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_tail_table: + .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00 + .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09 + .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08 + .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08 + .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08 + .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08 + .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08 + .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08 + .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08 + .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08 + .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08 + .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08 + .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08 + .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10 + .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08 + .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08 + .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08 + .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08 + .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08 + .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08 + .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08 + .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08 + .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08 + .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08 + .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09 + .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09 + .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08 + .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08 + .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08 + .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08 + .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09 + .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08 + .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08 + .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08 + .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08 + .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08 + .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09 + .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08 + .quad 0x03e33071282fb989b # 4.43021445893361960146e-09 + .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08 + .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08 + .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08 + .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08 + .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08 + .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09 + .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08 + .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08 + .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08 + .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08 + .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08 + .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08 + .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08 + .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08 + .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08 + .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08 + .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08 + .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08 + .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09 + .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08 + .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08 + .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08 + .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08 + .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08 + .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08 + .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08 + .quad 0 # for alignment + +
diff --git a/src/gas/vrd2log10.S b/src/gas/vrd2log10.S new file mode 100644 index 0000000..46cb2ad --- /dev/null +++ b/src/gas/vrd2log10.S
@@ -0,0 +1,628 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrd2log10.s +# +# An implementation of the log10 libm function. +# +# Prototype: +# +# __m128d __vrd2_log10(__m128d x); +# +# Computes the natural log10 of x. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. Runs 120-130 cycles for valid inputs. +# +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + +# define local variable storage offsets +.equ p_x,0 # temporary for error checking operation +.equ p_idx,0x010 # index storage + +.equ stack_size,0x028 + + + + .text + .align 16 + .p2align 4,,15 +.globl __vrd2_log10 + .type __vrd2_log10,@function +__vrd2_log10: + sub $stack_size,%rsp + + movdqa %xmm0,p_x(%rsp) # save the input values + + +# /* Store the exponent of x in xexp and put +# f into the range [0.5,1) */ + +# +# compute the index into the log10 tables +# + pxor %xmm1,%xmm1 + movdqa %xmm0,%xmm3 + psrlq $52,%xmm3 + psubq .L__mask_1023(%rip),%xmm3 + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm6 # xexp + movdqa %xmm0,%xmm2 + xor %rax,%rax + subpd .L__real_one(%rip),%xmm2 + + movdqa %xmm0,%xmm3 + andpd .L__real_notsign(%rip),%xmm2 + pand .L__real_mant(%rip),%xmm3 + movdqa %xmm3,%xmm4 + movapd .L__real_half(%rip),%xmm5 # .5 + + cmppd $1,.L__real_threshold(%rip),%xmm2 + movmskpd %xmm2,%r10d + cmp $3,%r10d + jz .Lall_nearone + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrlq $45,%xmm3 + movdqa %xmm3,%xmm2 + psrlq $1,%xmm3 + paddq .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm2 + paddq %xmm2,%xmm3 + + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm1 + xor %rcx,%rcx + movq %xmm3,p_idx(%rsp) + +# reduce and get u + por .L__real_half(%rip),%xmm4 + movdqa %xmm4,%xmm2 + + + mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128 + + + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx(%rsp),%eax + + subpd %xmm1,%xmm2 # f2 = f - f1 + mulpd %xmm2,%xmm5 + addpd %xmm5,%xmm1 + + divpd %xmm1,%xmm2 # u + +# do error checking here for scheduling. Saves a bunch of cycles as +# compared to doing this at the start of the routine. +## if NaN or inf + movapd %xmm0,%xmm3 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r8d + xorpd %xmm1,%xmm1 + + cmppd $2,%xmm1,%xmm0 + movmskpd %xmm0,%r9d + +# get z + movlpd -512(%rdx,%rax,8),%xmm0 # z1 + mov p_idx+4(%rsp),%ecx + movhpd -512(%rdx,%rcx,8),%xmm0 # z1 +# solve for ln(1+u) + movapd %xmm2,%xmm1 # u + mulpd %xmm2,%xmm2 # u^2 + movapd %xmm2,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm2,%xmm3 #Cu2 + mulpd %xmm1,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm2 # u^5 + movapd .L__real_log2_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm1 # u+Au3 + mulpd %xmm3,%xmm2 # u5(B+Cu2) + + addpd %xmm2,%xmm1 # poly +# recombine + mulpd %xmm6,%xmm4 # xexp * log2_lead + addpd %xmm4,%xmm0 #r1 + movapd %xmm0,%xmm2 + lea .L__np_ln_tail_table(%rip),%rdx + movlpd -512(%rdx,%rax,8),%xmm4 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm4 #z2 +=q + addpd %xmm4,%xmm1 + + mulpd .L__real_log2_tail(%rip),%xmm6 + + addpd %xmm6,%xmm1 #r2 + +# loge to log10 + movapd %xmm1,%xmm3 + mulpd .L__real_log10e_tail(%rip),%xmm1 + mulpd .L__real_log10e_tail(%rip),%xmm0 + addpd %xmm1,%xmm0 + mulpd .L__real_log10e_lead(%rip),%xmm3 + addpd %xmm3,%xmm0 + mulpd .L__real_log10e_lead(%rip),%xmm2 +# check for nans/infs + test $3,%r8d + addpd %xmm2,%xmm0 + + jnz .L__log_naninf +.L__vlog1: +# check for negative numbers or zero + test $3,%r9d + jnz .L__z_or_n + +.L__finish: +# see if we have a near one value + test $3,%r10d + jnz .L__near_one +.L__finishn1: + add $stack_size,%rsp + ret + + .align 16 +.Lall_nearone: +# saves 10 cycles +# r = x - 1.0; + movapd .L__real_two(%rip),%xmm2 + subpd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addpd %xmm0,%xmm2 + movapd %xmm0,%xmm1 + divpd %xmm2,%xmm1 # u + movapd .L__real_ca4(%rip),%xmm4 #D + movapd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movapd %xmm0,%xmm6 + mulpd %xmm1,%xmm6 # correction +# u = u + u; + addpd %xmm1,%xmm1 #u + movapd %xmm1,%xmm2 + mulpd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulpd %xmm1,%xmm5 # Cu + movapd %xmm1,%xmm3 + mulpd %xmm2,%xmm3 # u^3 + mulpd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulpd %xmm3,%xmm4 #Du^3 + + addpd .L__real_ca1(%rip),%xmm2 # +A + movapd %xmm3,%xmm1 + mulpd %xmm1,%xmm1 # u^6 + addpd %xmm4,%xmm5 #Cu+Du3 +# subsd %xmm6,%xmm0 ; -correction + + mulpd %xmm3,%xmm2 #u3(A+Bu2) + mulpd %xmm5,%xmm1 #u6(Cu+Du3) + addpd %xmm1,%xmm2 + subpd %xmm6,%xmm2 # -correction + +# loge to log10 + movapd %xmm0,%xmm3 #r1 = r + pand .L__mask_lower(%rip),%xmm3 + subpd %xmm3,%xmm0 + addpd %xmm0,%xmm2 #r2 = r2 + (r - r1); + + movapd %xmm3,%xmm0 + movapd %xmm2,%xmm1 + + mulpd .L__real_log10e_tail(%rip),%xmm2 + mulpd .L__real_log10e_tail(%rip),%xmm0 + mulpd .L__real_log10e_lead(%rip),%xmm1 + mulpd .L__real_log10e_lead(%rip),%xmm3 + addpd %xmm2,%xmm0 + addpd %xmm1,%xmm0 + addpd %xmm3,%xmm0 + +# return r + r2; +# addpd %xmm2,%xmm0 + jmp .L__finishn1 + + .align 16 +.L__near_one: + test $1,%r10d + jz .L__lnn12 + +# movapd %xmm0,%xmm6 ; save the inputs + movlpd p_x(%rsp),%xmm0 + call .L__ln1 +# shufpd xmm0,$2,%xmm6 + +.L__lnn12: + test $2,%r10d # second number? + jz .L__lnn1e + movlpd %xmm0,p_x(%rsp) + movlpd p_x+8(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,p_x+8(%rsp) + movapd p_x(%rsp),%xmm0 +# shufpd xmm6,$0,%xmm0 +# movapd %xmm6,%xmm0 + +.L__lnn1e: + jmp .L__finishn1 + +.L__ln1: +# saves 10 cycles +# r = x - 1.0; + movlpd .L__real_two(%rip),%xmm2 + subsd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addsd %xmm0,%xmm2 + movsd %xmm0,%xmm1 + divsd %xmm2,%xmm1 # u + movlpd .L__real_ca4(%rip),%xmm4 #D + movlpd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movsd %xmm0,%xmm6 + mulsd %xmm1,%xmm6 # correction +# u = u + u; + addsd %xmm1,%xmm1 #u + movsd %xmm1,%xmm2 + mulsd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulsd %xmm1,%xmm5 # Cu + movsd %xmm1,%xmm3 + mulsd %xmm2,%xmm3 # u^3 + mulsd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulsd %xmm3,%xmm4 #Du^3 + + addsd .L__real_ca1(%rip),%xmm2 # +A + movsd %xmm3,%xmm1 + mulsd %xmm1,%xmm1 # u^6 + addsd %xmm4,%xmm5 #Cu+Du3 +# subsd %xmm6,%xmm0 ; -correction + + mulsd %xmm3,%xmm2 #u3(A+Bu2) + mulsd %xmm5,%xmm1 #u6(Cu+Du3) + addsd %xmm1,%xmm2 + subsd %xmm6,%xmm2 # -correction + +# loge to log10 + movsd %xmm0,%xmm3 #r1 = r + pand .L__mask_lower(%rip),%xmm3 + subsd %xmm3,%xmm0 + addsd %xmm0,%xmm2 #r2 = r2 + (r - r1); + + movsd %xmm3,%xmm0 + movsd %xmm2,%xmm1 + + mulsd .L__real_log10e_tail(%rip),%xmm2 + mulsd .L__real_log10e_tail(%rip),%xmm0 + mulsd .L__real_log10e_lead(%rip),%xmm1 + mulsd .L__real_log10e_lead(%rip),%xmm3 + addsd %xmm2,%xmm0 + addsd %xmm1,%xmm0 + addsd %xmm3,%xmm0 + + + +# return r + r2; +# addsd %xmm2,%xmm0 + ret + + .align 16 + +# at least one of the numbers was a nan or infinity +.L__log_naninf: + test $1,%r8d # first number? + jz .L__lninf2 + + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rdx + movlpd p_x(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm1,%xmm0 + +.L__lninf2: + test $2,%r8d # second number? + jz .L__lninfe + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rdx + movlpd p_x+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + +.L__lninfe: + + cmp $3,%r8d # both numbers? + jz .L__finish # return early if so + jmp .L__vlog1 # continue processing if not + +# a subroutine to treat one number for nan/infinity +# the number is expected in rdx and returned in the low +# half of xmm0 +.L__lni: + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__lnan # jump if mantissa not zero, so it's a NaN +# inf + rcl $1,%rdx + jnc .L__lne2 # log(+inf) = inf +# negative x + movlpd .L__real_nan(%rip),%xmm0 + ret + +#NaN +.L__lnan: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__lne: + movd %rdx,%xmm0 +.L__lne2: + ret + + .align 16 + +# at least one of the numbers was a zero, a negative number, or both. +.L__z_or_n: + test $1,%r9d # first number? + jz .L__zn2 + +# movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rax + call .L__zni +# shufpd $2,%xmm1,%xmm0 + +.L__zn2: + test $2,%r9d # second number? + jz .L__zne + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + +.L__zne: + jmp .L__finish + +# a subroutine to treat one number for zero or negative values +# the number is expected in rax and returned in the low +# half of xmm0 +.L__zni: + shl $1,%rax + jnz .L__zn_x ## if just a carry, then must be negative + movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0 + ret +.L__zn_x: + movlpd .L__real_nan(%rip),%xmm0 + ret + + + + + .data + .align 16 + +.L__real_one: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_two: .quad 0x04000000000000000 # 2.0 + .quad 0x04000000000000000 +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0fff0000000000000 +.L__real_inf: .quad 0x07ff0000000000000 # +inf + .quad 0x07ff0000000000000 +.L__real_nan: .quad 0x07ff8000000000000 # NaN + .quad 0x07ff8000000000000 + +.L__real_sign: .quad 0x08000000000000000 # sign bit + .quad 0x08000000000000000 +.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit + .quad 0x07ffFFFFFFFFFFFFF +.L__real_threshold: .quad 0x03FB082C000000000 # .064495086669921875 Threshold + .quad 0x03FB082C000000000 +.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit + .quad 0x00008000000000000 +.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000FFFFFFFFFFFFF +.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */ + .quad 0x03f80000000000000 +.L__mask_1023: .quad 0x000000000000003ff # + .quad 0x000000000000003ff +.L__mask_040: .quad 0x00000000000000040 # + .quad 0x00000000000000040 +.L__mask_001: .quad 0x00000000000000001 # + .quad 0x00000000000000001 + +.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02 + .quad 0x03fb55555555554e6 +.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02 + .quad 0x03f89999999bac6d4 +.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03 + .quad 0x03f62492307f1519f +.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04 + .quad 0x03f3c8034c85dfff0 + +.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02 + .quad 0x03fb5555555555557 +.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02 + .quad 0x03f89999999865ede +.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03 + .quad 0x03f6249423bd94741 +.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01 + .quad 0x03fe62e42e0000000 +.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08 + .quad 0x03e6efa39ef35793c + +.L__real_half: .quad 0x03fe0000000000000 # 1/2 + .quad 0x03fe0000000000000 + +.L__real_log10e_lead: .quad 0x03fdbcb7800000000 # log10e_lead 4.34293746948242187500e-01 + .quad 0x03fdbcb7800000000 +.L__real_log10e_tail: .quad 0x03ea8a93728719535 # log10e_tail 7.3495500964015109100644e-7 + .quad 0x03ea8a93728719535 + +.L__mask_lower: .quad 0x0ffffffff00000000 + .quad 0x0ffffffff00000000 + + .align 16 + +.L__np_ln_lead_table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02 + .quad 0x3f9f829800000000 # 3.07716131210327148438e-02 + .quad 0x3fa7745800000000 # 4.58095073699951171875e-02 + .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02 + .quad 0x3fb341d700000000 # 7.52233862876892089844e-02 + .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02 + .quad 0x3fba926d00000000 # 1.03796780109405517578e-01 + .quad 0x3fbe270700000000 # 1.17783010005950927734e-01 + .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01 + .quad 0x3fc2955280000000 # 1.45181953907012939453e-01 + .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01 + .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01 + .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01 + .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01 + .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01 + .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01 + .quad 0x3fce270700000000 # 2.35566020011901855469e-01 + .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01 + .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01 + .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01 + .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01 + .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01 + .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01 + .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01 + .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01 + .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01 + .quad 0x3fd686c800000000 # 3.51976394653320312500e-01 + .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01 + .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01 + .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01 + .quad 0x3fd9479400000000 # 3.94993782043457031250e-01 + .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01 + .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01 + .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01 + .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01 + .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01 + .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01 + .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01 + .quad 0x3fde744240000000 # 4.75845873355865478516e-01 + .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01 + .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01 + .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01 + .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01 + .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01 + .quad 0x3fe109f380000000 # 5.32464742660522460938e-01 + .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01 + .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01 + .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01 + .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01 + .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01 + .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01 + .quad 0x3fe307d720000000 # 5.94707071781158447266e-01 + .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01 + .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01 + .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01 + .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01 + .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01 + .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01 + .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01 + .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01 + .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01 + .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01 + .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01 + .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_tail_table: + .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00 + .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09 + .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08 + .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08 + .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08 + .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08 + .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08 + .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08 + .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08 + .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08 + .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08 + .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08 + .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08 + .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10 + .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08 + .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08 + .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08 + .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08 + .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08 + .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08 + .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08 + .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08 + .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08 + .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08 + .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09 + .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09 + .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08 + .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08 + .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08 + .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08 + .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09 + .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08 + .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08 + .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08 + .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08 + .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08 + .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09 + .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08 + .quad 0x03e33071282fb989b # 4.43021445893361960146e-09 + .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08 + .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08 + .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08 + .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08 + .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08 + .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09 + .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08 + .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08 + .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08 + .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08 + .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08 + .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08 + .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08 + .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08 + .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08 + .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08 + .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08 + .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08 + .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09 + .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08 + .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08 + .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08 + .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08 + .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08 + .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08 + .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08 + .quad 0 # for alignment + +
diff --git a/src/gas/vrd2log2.S b/src/gas/vrd2log2.S new file mode 100644 index 0000000..92fe290 --- /dev/null +++ b/src/gas/vrd2log2.S
@@ -0,0 +1,621 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrd2log.s +# +# An implementation of the log libm function. +# +# Prototype: +# +# __m128d __vrd2_log2(__m128d x); +# +# Computes the log2 of x. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. Runs 105-115 cycles for valid inputs. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# define local variable storage offsets +.equ p_x,0 # temporary for error checking operation +.equ p_idx,0x010 # index storage + +.equ stack_size,0x028 + + + + .text + .align 16 + .p2align 4,,15 +.globl __vrd2_log2 + .type __vrd2_log2,@function +__vrd2_log2: + sub $stack_size,%rsp + + movdqa %xmm0,p_x(%rsp) # save the input values + + +# /* Store the exponent of x in xexp and put +# f into the range [0.5,1) */ + +# +# compute the index into the log tables +# + pxor %xmm1,%xmm1 + movdqa %xmm0,%xmm3 + psrlq $52,%xmm3 + psubq .L__mask_1023(%rip),%xmm3 + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm6 # xexp + movdqa %xmm0,%xmm2 + xor %rax,%rax + subpd .L__real_one(%rip),%xmm2 + + movdqa %xmm0,%xmm3 + andpd .L__real_notsign(%rip),%xmm2 + pand .L__real_mant(%rip),%xmm3 + movdqa %xmm3,%xmm4 + movapd .L__real_half(%rip),%xmm5 # .5 + + cmppd $1,.L__real_threshold(%rip),%xmm2 + movmskpd %xmm2,%r10d + cmp $3,%r10d + jz .Lall_nearone + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrlq $45,%xmm3 + movdqa %xmm3,%xmm2 + psrlq $1,%xmm3 + paddq .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm2 + paddq %xmm2,%xmm3 + + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm1 + xor %rcx,%rcx + movq %xmm3,p_idx(%rsp) + +# reduce and get u + por .L__real_half(%rip),%xmm4 + movdqa %xmm4,%xmm2 + + + mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128 + + + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx(%rsp),%eax + + subpd %xmm1,%xmm2 # f2 = f - f1 + mulpd %xmm2,%xmm5 + addpd %xmm5,%xmm1 + + divpd %xmm1,%xmm2 # u + +# do error checking here for scheduling. Saves a bunch of cycles as +# compared to doing this at the start of the routine. +## if NaN or inf + movapd %xmm0,%xmm3 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r8d + xorpd %xmm1,%xmm1 + + cmppd $2,%xmm1,%xmm0 + movmskpd %xmm0,%r9d + +# get z + movlpd -512(%rdx,%rax,8),%xmm0 # z1 + mov p_idx+4(%rsp),%ecx + movhpd -512(%rdx,%rcx,8),%xmm0 # z1 +# solve for ln(1+u) + movapd %xmm2,%xmm1 # u + mulpd %xmm2,%xmm2 # u^2 + movapd %xmm2,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm2,%xmm3 #Cu2 + mulpd %xmm1,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm2 # u^5 + movapd .L__real_log2e_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm1 # u+Au3 + movapd %xmm0,%xmm5 # z1 copy + mulpd %xmm3,%xmm2 # u5(B+Cu2) + movapd .L__real_log2e_tail(%rip),%xmm3 + addpd %xmm2,%xmm1 # poly +# recombine + lea .L__np_ln_tail_table(%rip),%rdx + movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q + addpd %xmm2,%xmm1 + movapd %xmm1,%xmm2 #z2 copy + + mulpd %xmm4,%xmm5 #z1*log2e_lead + mulpd %xmm4,%xmm1 #z2*log2e_lead + mulpd %xmm3,%xmm2 #z2*log2e_tail + mulpd %xmm3,%xmm0 #z1*log2e_tail + addpd %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp + addpd %xmm2,%xmm0 #z1*log2e_tail + z2*log2e_tail + addpd %xmm1,%xmm0 #r2 + + +# check for nans/infs + test $3,%r8d + addpd %xmm5,%xmm0 #r1+r2 + jnz .L__log_naninf +.L__vlog1: +# check for negative numbers or zero + test $3,%r9d + jnz .L__z_or_n + +.L__finish: +# see if we have a near one value + test $3,%r10d + jnz .L__near_one +.L__finishn1: + add $stack_size,%rsp + ret + + .align 16 +.Lall_nearone: +# saves 10 cycles +# r = x - 1.0; + movapd .L__real_two(%rip),%xmm2 + subpd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addpd %xmm0,%xmm2 + movapd %xmm0,%xmm1 + divpd %xmm2,%xmm1 # u + movapd .L__real_ca4(%rip),%xmm4 #D + movapd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movapd %xmm0,%xmm6 + mulpd %xmm1,%xmm6 # correction +# u = u + u; + addpd %xmm1,%xmm1 #u + movapd %xmm1,%xmm2 + mulpd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulpd %xmm1,%xmm5 # Cu + movapd %xmm1,%xmm3 + mulpd %xmm2,%xmm3 # u^3 + mulpd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulpd %xmm3,%xmm4 #Du^3 + + addpd .L__real_ca1(%rip),%xmm2 # +A + movapd %xmm3,%xmm1 + mulpd %xmm1,%xmm1 # u^6 + addpd %xmm4,%xmm5 #Cu+Du3 + movapd .L__real_log2e_tail(%rip),%xmm4 +# subsd %xmm6,%xmm0 ; -correction + + mulpd %xmm3,%xmm2 #u3(A+Bu2) + mulpd %xmm5,%xmm1 #u6(Cu+Du3) + movapd .L__real_log2e_lead(%rip),%xmm5 + addpd %xmm1,%xmm2 + subpd %xmm6,%xmm2 # -correction + +# loge to log2 + movapd %xmm0,%xmm3 #r1 = r + pand .L__mask_lower(%rip),%xmm3 + subpd %xmm3,%xmm0 + addpd %xmm0,%xmm2 #r2 = r2 + (r - r1); + + movapd %xmm3,%xmm0 + movapd %xmm2,%xmm1 + + mulpd %xmm4,%xmm2 + mulpd %xmm4,%xmm0 + mulpd %xmm5,%xmm1 + mulpd %xmm5,%xmm3 + addpd %xmm2,%xmm0 + addpd %xmm1,%xmm0 + addpd %xmm3,%xmm0 +# return r + r2; +# addpd %xmm2,%xmm0 + jmp .L__finishn1 + + .align 16 +.L__near_one: + test $1,%r10d + jz .L__lnn12 + +# movapd %xmm0,%xmm6 ; save the inputs + movlpd p_x(%rsp),%xmm0 + call .L__ln1 +# shufpd xmm0,$2,%xmm6 + +.L__lnn12: + test $2,%r10d # second number? + jz .L__lnn1e + movlpd %xmm0,p_x(%rsp) + movlpd p_x+8(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,p_x+8(%rsp) + movapd p_x(%rsp),%xmm0 +# shufpd xmm6,$0,%xmm0 +# movapd %xmm6,%xmm0 + +.L__lnn1e: + jmp .L__finishn1 + +.L__ln1: +# saves 10 cycles +# r = x - 1.0; + movlpd .L__real_two(%rip),%xmm2 + subsd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addsd %xmm0,%xmm2 + movsd %xmm0,%xmm1 + divsd %xmm2,%xmm1 # u + movlpd .L__real_ca4(%rip),%xmm4 #D + movlpd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movsd %xmm0,%xmm6 + mulsd %xmm1,%xmm6 # correction +# u = u + u; + addsd %xmm1,%xmm1 #u + movsd %xmm1,%xmm2 + mulsd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulsd %xmm1,%xmm5 # Cu + movsd %xmm1,%xmm3 + mulsd %xmm2,%xmm3 # u^3 + mulsd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulsd %xmm3,%xmm4 #Du^3 + + addsd .L__real_ca1(%rip),%xmm2 # +A + movsd %xmm3,%xmm1 + mulsd %xmm1,%xmm1 # u^6 + addsd %xmm4,%xmm5 #Cu+Du3 +# subsd %xmm6,%xmm0 ; -correction + movsd .L__real_log2e_tail(%rip),%xmm4 + + mulsd %xmm3,%xmm2 #u3(A+Bu2) + mulsd %xmm5,%xmm1 #u6(Cu+Du3) + movsd .L__real_log2e_lead(%rip),%xmm5 + addsd %xmm1,%xmm2 + subsd %xmm6,%xmm2 # -correction + +# loge to log2 + movsd %xmm0,%xmm3 #r1 = r + pand .L__mask_lower(%rip),%xmm3 + subsd %xmm3,%xmm0 + addsd %xmm0,%xmm2 #r2 = r2 + (r - r1); + + movsd %xmm3,%xmm0 + movsd %xmm2,%xmm1 + + mulsd %xmm4,%xmm2 + mulsd %xmm4,%xmm0 + mulsd %xmm5,%xmm1 + mulsd %xmm5,%xmm3 + addsd %xmm2,%xmm0 + addsd %xmm1,%xmm0 + addsd %xmm3,%xmm0 + +# return r + r2; +# addsd %xmm2,%xmm0 + ret + + .align 16 + +# at least one of the numbers was a nan or infinity +.L__log_naninf: + test $1,%r8d # first number? + jz .L__lninf2 + + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rdx + movlpd p_x(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm1,%xmm0 + +.L__lninf2: + test $2,%r8d # second number? + jz .L__lninfe + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rdx + movlpd p_x+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + +.L__lninfe: + + cmp $3,%r8d # both numbers? + jz .L__finish # return early if so + jmp .L__vlog1 # continue processing if not + +# a subroutine to treat one number for nan/infinity +# the number is expected in rdx and returned in the low +# half of xmm0 +.L__lni: + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__lnan # jump if mantissa not zero, so it's a NaN +# inf + rcl $1,%rdx + jnc .L__lne2 # log(+inf) = inf +# negative x + movlpd .L__real_nan(%rip),%xmm0 + ret + +#NaN +.L__lnan: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__lne: + movd %rdx,%xmm0 +.L__lne2: + ret + + .align 16 + +# at least one of the numbers was a zero, a negative number, or both. +.L__z_or_n: + test $1,%r9d # first number? + jz .L__zn2 + +# movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rax + call .L__zni +# shufpd $2,%xmm1,%xmm0 + +.L__zn2: + test $2,%r9d # second number? + jz .L__zne + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + +.L__zne: + jmp .L__finish + +# a subroutine to treat one number for zero or negative values +# the number is expected in rax and returned in the low +# half of xmm0 +.L__zni: + shl $1,%rax + jnz .L__zn_x ## if just a carry, then must be negative + movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0 + ret +.L__zn_x: + movlpd .L__real_nan(%rip),%xmm0 + ret + + + + + .data + .align 16 + +.L__real_one: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_two: .quad 0x04000000000000000 # 2.0 + .quad 0x04000000000000000 +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0fff0000000000000 +.L__real_inf: .quad 0x07ff0000000000000 # +inf + .quad 0x07ff0000000000000 +.L__real_nan: .quad 0x07ff8000000000000 # NaN + .quad 0x07ff8000000000000 + +.L__real_sign: .quad 0x08000000000000000 # sign bit + .quad 0x08000000000000000 +.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit + .quad 0x07ffFFFFFFFFFFFFF +.L__real_threshold: .quad 0x03F9EB85000000000 # .03 + .quad 0x03F9EB85000000000 +.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit + .quad 0x00008000000000000 +.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000FFFFFFFFFFFFF +.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */ + .quad 0x03f80000000000000 +.L__mask_1023: .quad 0x000000000000003ff # + .quad 0x000000000000003ff +.L__mask_040: .quad 0x00000000000000040 # + .quad 0x00000000000000040 +.L__mask_001: .quad 0x00000000000000001 # + .quad 0x00000000000000001 + +.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02 + .quad 0x03fb55555555554e6 +.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02 + .quad 0x03f89999999bac6d4 +.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03 + .quad 0x03f62492307f1519f +.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04 + .quad 0x03f3c8034c85dfff0 + +.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02 + .quad 0x03fb5555555555557 +.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02 + .quad 0x03f89999999865ede +.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03 + .quad 0x03f6249423bd94741 +.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01 + .quad 0x03fe62e42e0000000 +.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08 + .quad 0x03e6efa39ef35793c + +.L__real_half: .quad 0x03fe0000000000000 # 1/2 + .quad 0x03fe0000000000000 +.L__real_log2e_lead: .quad 0x03FF7154400000000 # log2e_lead 1.44269180297851562500E+00 + .quad 0x03FF7154400000000 +.L__real_log2e_tail : .quad 0x03ECB295C17F0BBBE # log2e_tail 3.23791044778235969970E-06 + .quad 0x03ECB295C17F0BBBE +.L__mask_lower: .quad 0x0ffffffff00000000 + .quad 0x0ffffffff00000000 + .align 16 + +.L__np_ln_lead_table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02 + .quad 0x3f9f829800000000 # 3.07716131210327148438e-02 + .quad 0x3fa7745800000000 # 4.58095073699951171875e-02 + .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02 + .quad 0x3fb341d700000000 # 7.52233862876892089844e-02 + .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02 + .quad 0x3fba926d00000000 # 1.03796780109405517578e-01 + .quad 0x3fbe270700000000 # 1.17783010005950927734e-01 + .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01 + .quad 0x3fc2955280000000 # 1.45181953907012939453e-01 + .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01 + .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01 + .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01 + .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01 + .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01 + .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01 + .quad 0x3fce270700000000 # 2.35566020011901855469e-01 + .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01 + .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01 + .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01 + .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01 + .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01 + .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01 + .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01 + .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01 + .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01 + .quad 0x3fd686c800000000 # 3.51976394653320312500e-01 + .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01 + .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01 + .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01 + .quad 0x3fd9479400000000 # 3.94993782043457031250e-01 + .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01 + .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01 + .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01 + .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01 + .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01 + .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01 + .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01 + .quad 0x3fde744240000000 # 4.75845873355865478516e-01 + .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01 + .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01 + .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01 + .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01 + .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01 + .quad 0x3fe109f380000000 # 5.32464742660522460938e-01 + .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01 + .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01 + .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01 + .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01 + .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01 + .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01 + .quad 0x3fe307d720000000 # 5.94707071781158447266e-01 + .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01 + .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01 + .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01 + .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01 + .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01 + .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01 + .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01 + .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01 + .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01 + .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01 + .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01 + .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_tail_table: + .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00 + .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09 + .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08 + .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08 + .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08 + .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08 + .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08 + .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08 + .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08 + .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08 + .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08 + .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08 + .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08 + .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10 + .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08 + .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08 + .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08 + .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08 + .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08 + .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08 + .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08 + .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08 + .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08 + .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08 + .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09 + .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09 + .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08 + .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08 + .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08 + .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08 + .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09 + .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08 + .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08 + .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08 + .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08 + .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08 + .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09 + .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08 + .quad 0x03e33071282fb989b # 4.43021445893361960146e-09 + .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08 + .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08 + .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08 + .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08 + .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08 + .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09 + .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08 + .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08 + .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08 + .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08 + .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08 + .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08 + .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08 + .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08 + .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08 + .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08 + .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08 + .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08 + .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09 + .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08 + .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08 + .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08 + .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08 + .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08 + .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08 + .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08 + .quad 0 # for alignment + +
diff --git a/src/gas/vrd2sin.S b/src/gas/vrd2sin.S new file mode 100644 index 0000000..50c0deb --- /dev/null +++ b/src/gas/vrd2sin.S
@@ -0,0 +1,805 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# A vector implementation of the libm sin function. +# +# Prototype: +# +# __m128d __vrd2_sin(__m128d x); +# +# Computes Sine of x +# It will provide proper C99 return values, +# but may not raise floating point status bits properly. +# Based on the NAG C implementation. +# Author: Harsha Jagasia +# Email: harsha.jagasia@amd.com + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + +.data +.align 16 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__real_ffffffffffffffff: .quad 0x0ffffffffffffffff #Sign bit one + .quad 0x0ffffffffffffffff + +.Lcosarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03fa5555555555555 + .quad 0x0bf56c16c16c16967 # -0.00138889 c2 + .quad 0x0bf56c16c16c16967 + .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3 + .quad 0x03efa01a019f4ec90 + .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4 + .quad 0x0be927e4fa17f65f6 + .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5 + .quad 0x03e21eeb69037ab78 + .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6 + .quad 0x0bda907db46cc5e42 +.Lsinarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bfc5555555555555 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03f81111111110bb3 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0bf2a01a019e83e5c + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03ec71de3796cde01 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0be5ae600b42fdfa7 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x03de5e0b2f9a43bb8 +.Lsincosarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x0bf56c16c16c16967 # c2 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x03efa01a019f4ec90 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x0be927e4fa17f65f6 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x03e21eeb69037ab78 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x0bda907db46cc5e42 +.Lcossinarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bf56c16c16c16967 # c2 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03efa01a019f4ec90 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0be927e4fa17f65f6 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03e21eeb69037ab78 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0bda907db46cc5e42 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + .text + .align 16 + .p2align 4,,15 + +.equ p_temp, 0x00 # temporary for get/put bits operation +.equ p_temp1,0x10 # temporary for get/put bits operation +.equ p_temp2,0x20 # temporary for get/put bits operation +.equ p_xmm6, 0x30 # temporary for get/put bits operation +.equ p_xmm7, 0x40 # temporary for get/put bits operation +.equ p_xmm8, 0x50 # temporary for get/put bits operation +.equ p_xmm9, 0x60 # temporary for get/put bits operation +.equ p_xmm10,0x70 # temporary for get/put bits operation +.equ p_xmm11,0x80 # temporary for get/put bits operation +.equ p_xmm12,0x90 # temporary for get/put bits operation +.equ p_xmm13,0x0A0 # temporary for get/put bits operation +.equ p_xmm14,0x0B0 # temporary for get/put bits operation +.equ p_xmm15,0x0C0 # temporary for get/put bits operation +.equ r, 0x0D0 # pointer to r for remainder_piby2 +.equ rr, 0x0E0 # pointer to r for remainder_piby2 +.equ region, 0x0F0 # pointer to r for remainder_piby2 +.equ p_original,0x100 # original x +.equ p_mask, 0x110 # original x +.equ p_sign, 0x120 # original x + +.globl __vrd2_sin + .type __vrd2_sin,@function +__vrd2_sin: + + sub $0x138,%rsp + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#STARTMAIN +movdqa %xmm0,%xmm6 #move to mem to get into integer regs ** +andpd .L__real_7fffffffffffffff(%rip), %xmm0 #Unsign - + +movd %xmm0,%rax #rax is lower arg + +movhpd %xmm0, p_temp+8(%rsp) # + +mov p_temp+8(%rsp),%rcx #rcx = upper arg + +movdqa %xmm0,%xmm1 + + #This will mask all nan/infs also +pcmpgtd %xmm6,%xmm1 +movdqa %xmm1,%xmm6 +psrldq $4, %xmm1 +psrldq $8, %xmm6 + +mov $0x3FE921FB54442D18,%rdx #piby4 + +mov $0x411E848000000000,%r10 #5e5 + + +movapd .L__real_3fe0000000000000(%rip), %xmm5 #0.5 for later use + + +por %xmm1,%xmm6 +movd %xmm6,%r11 #Move Sign to gpr ** + +movapd %xmm0,%xmm2 #x + +movapd %xmm0,%xmm4 #x + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Leither_or_both_arg_gt_than_piby4: + + cmp %r10,%rax #is lower arg >= 5e5 + jae .Llower_or_both_arg_gt_5e5 + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lupper_arg_gt_5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lboth_arg_lt_than_5e5: +# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + movapd .L__real_3ff921fb54400000(%rip),%xmm3 # xmm3=piby2_1 + addpd %xmm5,%xmm2 # xmm2 = npi2 = x*twobypi+0.5 + movapd .L__real_3dd0b4611a600000(%rip),%xmm1 # xmm1=piby2_2 + movapd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6=piby2_2tail + cvttpd2dq %xmm2,%xmm0 # xmm0=convert npi2 to ints + cvtdq2pd %xmm0,%xmm2 # xmm2=and back to double. + + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm3 # npi2 * piby2_1 + subpd %xmm3,%xmm4 # xmm4 = rhead=x-npi2*piby2_1 + +#t = rhead; + movapd %xmm4,%xmm5 # xmm5=t=rhead + +#rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm1 # xmm1= npi2*piby2_2 + +#rhead = t - rtail; + subpd %xmm1,%xmm4 # xmm4= rhead = t-rtail + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd %xmm2,%xmm6 # npi2 * piby2_2tail + subpd %xmm4,%xmm5 # t-rhead + subpd %xmm5,%xmm1 # rtail-(t - rhead) + addpd %xmm6,%xmm1 # rtail=npi2*piby2_2+(rtail-(t-rhead)) + +#r = rhead - rtail +#rr=(rhead-r) -rtail +#Sign +#Region + movdqa %xmm0,%xmm5 # Region + + movd %xmm0,%r10 # Sign + movdqa %xmm4,%xmm0 # rhead (handle xmm0 retype) + + + subpd %xmm1,%xmm0 # rhead - rtail + + pand .L__reald_one_one(%rip),%xmm5 # Odd/Even region for Cos/Sin + + mov .L__reald_one_zero(%rip),%r9 # Compare value for cossin + + subpd %xmm0,%xmm4 # rr=rhead-r + + movd %xmm5,%r8 # Region + + movapd %xmm0,%xmm2 # Move for x2 + + mulpd %xmm0,%xmm2 # x2 + + subpd %xmm1,%xmm4 # rr=(rhead-r) -rtail + + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + mov %r10,%rcx + not %r11 #ADDED TO CHANGE THE LOGIC + and %r11,%r10 + not %rcx + not %r11 + and %r11,%rcx + or %rcx,%r10 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + + mov %r10,%r11 + and %r9,%r11 #mask out the lower sign bit leaving the upper sign bit + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $31,%r11 #shift upper sign bit left by 31 bits + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r11,p_sign+8(%rsp) #write out upper sign bit + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#xmm0= x, xmm2 = x2, xmm4 = xx, r8 = region, r9 = compare value for sincos path, xmm6 = Sign + +.align 16 +.L__vrd2_sin_approximate: + cmp $0,%r8 + jnz .Lvrd2_not_sin_piby4 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lvrd2_sin_piby4: + movapd .Lsinarray+0x50(%rip),%xmm3 # s6 + movapd .Lsinarray+0x20(%rip),%xmm5 # s3 + movapd %xmm2,%xmm1 # move for x4 + + mulpd %xmm2,%xmm3 # x2s6 + mulpd %xmm2,%xmm5 # x2s3 + mulpd %xmm2,%xmm1 # x4 + + addpd .Lsinarray+0x40(%rip),%xmm3 # s5+x2s6 + movapd %xmm2,%xmm6 # move for x3 + addpd .Lsinarray+0x10(%rip),%xmm5 # s2+x2s3 + + mulpd %xmm2,%xmm3 # x2(s5+x2s6) + mulpd %xmm2,%xmm5 # x2(s2+x2s3) + mulpd %xmm2,%xmm1 # x6 + + addpd .Lsinarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6) + addpd .Lsinarray(%rip),%xmm5 # s1+x2(s2+x2s3) + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2 + + mulpd %xmm1,%xmm3 # x6(s4 + x2(s5+x2s6)) + mulpd %xmm0,%xmm6 # x3 + addpd %xmm5,%xmm3 # zs + mulpd %xmm4,%xmm2 # 0.5 * x2 *xx + + mulpd %xmm3,%xmm6 # x3*zs + subpd %xmm2,%xmm6 # x3*zs - 0.5 * x2 *xx + addpd %xmm4,%xmm6 # +xx + addpd %xmm6,%xmm0 # +x + xorpd p_sign(%rsp),%xmm0 # xor sign + jmp .L__vrd2_sin_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#xmm0= x, xmm2 = x2, xmm4 = xx, r8 = region, r9 = compare value for sincos path, xmm6 = Sign +.align 16 +.Lvrd2_not_sin_piby4: + cmp $1,%r8 + jnz .Lvrd2_not_sin_cos_piby4 + +.Lvrd2_sin_cos_piby4: + + movapd %xmm4,p_temp(%rsp) # rr move to to memory + movapd %xmm0,p_temp1(%rsp) # r move to to memory + + movapd .Lcossinarray+0x50(%rip),%xmm3 # s6 + mulpd %xmm2,%xmm3 # x2s6 + movdqa .Lcossinarray+0x20(%rip),%xmm5 # s3 + movapd %xmm2,%xmm1 # move x2 for x4 + mulpd %xmm2,%xmm1 # x4 + mulpd %xmm2,%xmm5 # x2s3 + + addpd .Lcossinarray+0x40(%rip),%xmm3 # s5+x2s6 + movapd %xmm2,%xmm4 # move for x6 + mulpd %xmm2,%xmm3 # x2(s5+x2s6) + mulpd %xmm1,%xmm4 # x6 + addpd .Lcossinarray+0x10(%rip),%xmm5 # s2+x2s3 + mulpd %xmm2,%xmm5 # x2(s2+x2s3) + addpd .Lcossinarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6) + + movhlps %xmm0,%xmm0 # high of x for x3 + mulpd %xmm4,%xmm3 # x6(s4 + x2(s5+x2s6)) + addpd .Lcossinarray(%rip),%xmm5 # s1+x2(s2+x2s3) + + movhlps %xmm2,%xmm4 # high of x2 for x3 + addpd %xmm5,%xmm3 # z + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 + mulsd %xmm0,%xmm4 # x3 # + movhlps %xmm3,%xmm5 # xmm5 = sin + # xmm3 = cos + + mulsd %xmm4,%xmm5 # sin*x3 # + movsd .L__real_3ff0000000000000(%rip),%xmm4 # 1.0 # + mulsd %xmm1,%xmm3 # cos*x4 # + + subsd %xmm2,%xmm4 # t=1.0-r # + + movhlps %xmm2,%xmm6 # move 0.5 * x2 for 0.5 * x2 * xx # + mulsd p_temp+8(%rsp),%xmm6 # 0.5 * x2 * xx # + subsd %xmm6,%xmm5 # sin - 0.5 * x2 *xx # + addsd p_temp+8(%rsp),%xmm5 # sin+xx # + + movlpd p_temp1(%rsp),%xmm6 # x + mulsd p_temp(%rsp),%xmm6 # x *xx # + + movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1 # + subsd %xmm4,%xmm1 # 1 -t # + addsd %xmm5,%xmm0 # sin+x # + subsd %xmm2,%xmm1 # (1-t) - r # + subsd %xmm6,%xmm1 # ((1 + (-t)) - r) - x*xx # + addsd %xmm1,%xmm3 # cos+((1 + (-t)) - r) - x*xx # + addsd %xmm4,%xmm3 # cos+t # + + movapd p_sign(%rsp),%xmm2 # load sign + movlhps %xmm0,%xmm3 + movapd %xmm3,%xmm0 + xorpd %xmm2,%xmm0 + jmp .L__vrd2_sin_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#xmm0= x, xmm2 = x2, xmm4 = xx, r8 = region, r9 = compare value for sincos path, xmm6 = Sign +.align 16 +.Lvrd2_not_sin_cos_piby4: + cmp %r9,%r8 + jnz .Lvrd2_cos_piby4 + +.Lvrd2_cos_sin_piby4: + + movapd %xmm4,p_temp(%rsp) # Store rr + movapd .Lsincosarray+0x50(%rip),%xmm3 # s6 + mulpd %xmm2,%xmm3 # x2s6 + movdqa .Lsincosarray+0x20(%rip),%xmm5 # s3 (handle xmm5 retype) + movapd %xmm2,%xmm1 # move x2 for x4 + mulpd %xmm2,%xmm1 # x4 + mulpd %xmm2,%xmm5 # x2s3 + addpd .Lsincosarray+0x40(%rip),%xmm3 # s5+x2s6 + movapd %xmm2,%xmm4 # move x2 for x6 + mulpd %xmm2,%xmm3 # x2(s5+x2s6) + mulpd %xmm1,%xmm4 # x6 + addpd .Lsincosarray+0x10(%rip),%xmm5 # s2+x2s3 + mulpd %xmm2,%xmm5 # x2(s2+x2s3) + addpd .Lsincosarray+0x30(%rip),%xmm3 # s4+x2(s5+x2s6) + + movhlps %xmm1,%xmm1 # move high x4 for cos + mulpd %xmm4,%xmm3 # x6(s4+x2(s5+x2s6)) + addpd .Lsincosarray(%rip),%xmm5 # s1+x2(s2+x2s3) + movapd %xmm2,%xmm4 # move low x2 for x3 + mulsd %xmm0,%xmm4 # get low x3 for sin term + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 + + addpd %xmm3,%xmm5 # z + movhlps %xmm2,%xmm6 # move high r for cos + movhlps %xmm5,%xmm3 # xmm5 = sin + # xmm3 = cos + mulsd p_temp(%rsp),%xmm2 # 0.5 * x2 * xx + + mulsd %xmm4,%xmm5 # sin *x3 + movsd .L__real_3ff0000000000000(%rip),%xmm4 # 1.0 + mulsd %xmm1,%xmm3 # cos *x4 + subsd %xmm6,%xmm4 # t=1.0-r + + movhlps %xmm0,%xmm1 + subsd %xmm2,%xmm5 # sin - 0.5 * x2 *xx + + mulsd p_temp+8(%rsp),%xmm1 # x * xx + movsd .L__real_3ff0000000000000(%rip),%xmm2 # 1 + subsd %xmm4,%xmm2 # 1 - t + addsd p_temp(%rsp),%xmm5 # sin+xx + + subsd %xmm6,%xmm2 # (1-t) - r + subsd %xmm1,%xmm2 # ((1 + (-t)) - r) - x*xx + addsd %xmm5,%xmm0 # sin + x + addsd %xmm2,%xmm3 # cos+((1-t)-r - x*xx) + addsd %xmm4,%xmm3 # cos+t + + movapd p_sign(%rsp),%xmm5 # load sign + movlhps %xmm3,%xmm0 + xorpd %xmm5,%xmm0 + jmp .L__vrd2_sin_cleanup + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 + +.Lvrd2_cos_piby4: + mulpd %xmm0,%xmm4 # x*xx + movdqa .L__real_3fe0000000000000(%rip),%xmm5 # 0.5 (handle xmm5 retype) + movapd .Lcosarray+0x50(%rip),%xmm1 # c6 + movapd .Lcosarray+0x20(%rip),%xmm0 # c3 + mulpd %xmm2,%xmm5 # r = 0.5 *x2 + movapd %xmm2,%xmm3 # copy of x2 for x4 + movapd %xmm4,p_temp(%rsp) # store x*xx + mulpd %xmm2,%xmm1 # c6*x2 + mulpd %xmm2,%xmm0 # c3*x2 + subpd .L__real_3ff0000000000000(%rip),%xmm5 # -t=r-1.0 + mulpd %xmm2,%xmm3 # x4 + addpd .Lcosarray+0x40(%rip),%xmm1 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm0 # c2+x2C3 + addpd .L__real_3ff0000000000000(%rip),%xmm5 # 1 + (-t) + mulpd %xmm2,%xmm3 # x6 + mulpd %xmm2,%xmm1 # x2(c5+x2c6) + mulpd %xmm2,%xmm0 # x2(c2+x2C3) + movapd %xmm2,%xmm4 # copy of x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm4 # r = 0.5 *x2 + addpd .Lcosarray+0x30(%rip),%xmm1 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm0 # c1+x2(c2+x2C3) + mulpd %xmm2,%xmm2 # x4 + subpd %xmm4,%xmm5 # (1 + (-t)) - r + mulpd %xmm3,%xmm1 # x6(c4 + x2(c5+x2c6)) + addpd %xmm1,%xmm0 # zc + subpd .L__real_3ff0000000000000(%rip),%xmm4 # -t=r-1.0 + subpd p_temp(%rsp),%xmm5 # ((1 + (-t)) - r) - x*xx + mulpd %xmm2,%xmm0 # x4 * zc + addpd %xmm5,%xmm0 # x4 * zc + ((1 + (-t)) - r -x*xx) + subpd %xmm4,%xmm0 # result - (-t) + xorpd p_sign(%rsp),%xmm0 # xor with sign + jmp .L__vrd2_sin_cleanup + + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Llower_or_both_arg_gt_5e5: + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 + + movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call + + movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm4,%xmm4 + +# Work on Upper arg +# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5 +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + +#If upper Arg is <=piby4 + cmp %rdx,%rcx # is upper arg > piby4 + ja 0f + + mov $0,%ecx # region = 0 + mov %ecx,region+4(%rsp) # store upper region + movlpd %xmm0,r+8(%rsp) # store upper r (unsigned - sign is adjusted later based on sign) + xorpd %xmm4,%xmm4 # rr = 0 + movlpd %xmm4,rr+8(%rsp) # store upper rr + jmp .Lcheck_lower_arg + +#If upper Arg is > piby4 +.align 16 +0: + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm5,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm3 # xmm3 = piby2_1 + cvttsd2si %xmm2,%ecx # xmm0 = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm1 # xmm1 = piby2_2 + cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles + + #/* Subtract the multiple from x to get an extra-precision remainder */ + #rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm3 # npi2 * piby2_1 + subsd %xmm3,%xmm4 # xmm4 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + + #t = rhead; + movsd %xmm4,%xmm5 # xmm5 = t = rhead + + #rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm1 # xmm1 =rtail=(npi2*piby2_2) + + #rhead = t - rtail + subsd %xmm1,%xmm4 # xmm4 =rhead=(t-rtail) + + #rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm6 # npi2 * piby2_2tail + subsd %xmm4,%xmm5 # t-rhead + subsd %xmm5,%xmm1 # (rtail-(t-rhead)) + addsd %xmm6,%xmm1 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + + #r = rhead - rtail + #rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm4,%xmm0 + subsd %xmm1,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm4 # rr=rhead-r + subsd %xmm1,%xmm4 # xmm4 = rr=((rhead-r) -rtail) + movlpd %xmm0,r+8(%rsp) # store upper r + movlpd %xmm4,rr+8(%rsp) # store upper rr + +#If lower Arg is > 5e5 +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign +.align 16 +.Lcheck_lower_arg: + mov $0x07ff0000000000000,%r9 # is lower arg nan/inf + mov %r9,%r10 + and %rax,%r10 + cmp %r9,%r10 + jz .L__vrd2_cos_lower_naninf + + mov %r11,p_temp(%rsp) #Save Sign + + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call + + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r11 #Restore Sign + + jmp .L__vrd2_sin_reconstruct + +.L__vrd2_cos_lower_naninf: + mov r(%rsp),%rax + mov $0x00008000000000000,%r9 + or %r9,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) # rr = 0 + mov %r10d,region(%rsp) # region =0 + + jmp .L__vrd2_sin_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lupper_arg_gt_5e5: +# Upper Arg is >= 5e5, Lower arg is < 5e5 + movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call + movlhps %xmm0,%xmm0 #Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case + movlhps %xmm2,%xmm2 + movlhps %xmm4,%xmm4 + +# Work on Lower arg +# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5 +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + +#If lower Arg is <=piby4 + cmp %rdx,%rax # is upper arg > piby4 + ja 0f + + mov $0,%eax # region = 0 + mov %eax,region(%rsp) # store upper region + movlpd %xmm0,r(%rsp) # store upper r + xorpd %xmm4,%xmm4 # rr = 0 + movlpd %xmm4,rr(%rsp) # store upper rr + jmp .Lcheck_upper_arg + +.align 16 +0: +#If upper Arg is > piby4 + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm5,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm3 # xmm3 = piby2_1 + cvttsd2si %xmm2,%eax # xmm0 = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm1 # xmm1 = piby2_2 + cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm3 # npi2 * piby2_1 + subsd %xmm3,%xmm4 # xmm4 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm4,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm1 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm1,%xmm4 # xmm4 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm6 # npi2 * piby2_2tail + subsd %xmm4,%xmm5 # t-rhead + subsd %xmm5,%xmm1 # (rtail-(t-rhead)) + addsd %xmm6,%xmm1 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store lower region + movsd %xmm4,%xmm0 + subsd %xmm1,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm4 # rr=rhead-r + subsd %xmm1,%xmm4 # xmm4 = rr=((rhead-r) -rtail) + movlpd %xmm0,r(%rsp) # store lower r + movlpd %xmm4,rr(%rsp) # store lower rr + +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign +.align 16 +.Lcheck_upper_arg: + mov $0x07ff0000000000000,%r9 # is upper arg nan/inf + mov %r9,%r10 + and %rcx,%r10 + cmp %r9,%r10 + jz .L__vrd2_cos_upper_naninf + + mov %r11,p_temp(%rsp) #Save Sign + + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r11 #Restore Sign + + jmp .L__vrd2_sin_reconstruct + +.L__vrd2_cos_upper_naninf: + mov r+8(%rsp),%rcx # upper arg is nan/inf + mov $0x00008000000000000,%r9 + or %r9,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) # rr = 0 + mov %r10d,region+4(%rsp) # region =0 + jmp .L__vrd2_sin_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 + + movhpd %xmm0,p_temp2(%rsp) #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r9 #is lower arg nan/inf + mov %r9,%r10 + and %rax,%r10 + cmp %r9,%r10 + jz .L__vrd2_cos_lower_naninf_of_both_gt_5e5 + + mov %rcx,p_temp(%rsp) #Save upper arg + mov %r11,p_temp1(%rsp) #Save Sign + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + call __amd_remainder_piby2@PLT + + mov p_temp1(%rsp),%r11 #Restore Sign + mov p_temp(%rsp),%rcx #Restore upper arg + jmp 0f + +.L__vrd2_cos_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf + movd %xmm0,%rax + mov $0x00008000000000000,%r9 + or %r9,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) #rr = 0 + mov %r10d,region(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r9 #is upper arg nan/inf + mov %r9,%r10 + and %rcx,%r10 + cmp %r9,%r10 + jz .L__vrd2_cos_upper_naninf_of_both_gt_5e5 + + + mov %r11,p_temp(%rsp) #Save Sign + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd p_temp2(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r11 #Restore Sign + + jmp 0f + +.L__vrd2_cos_upper_naninf_of_both_gt_5e5: + mov p_temp2(%rsp),%rcx #upper arg is nan/inf + mov $0x00008000000000000,%r9 + or %r9,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) #rr = 0 + mov %r10d,region+4(%rsp) #region = 0 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +0: +.L__vrd2_sin_reconstruct: +#Construct xmm0=x, xmm2 =x2, xmm4=xx, r8=region, xmm6=sign + movapd r(%rsp),%xmm0 #x + movapd %xmm0,%xmm2 #move for x2 + mulpd %xmm2,%xmm2 #x2 + movapd rr(%rsp),%xmm4 #xx + + mov region(%rsp),%r8 + mov .L__reald_one_zero(%rip),%r9 #compare value for cossin path + mov %r8,%r10 + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + mov %r10,%rcx + not %r11 #ADDED TO CHANGE THE LOGIC + and %r11,%r10 + not %rcx + not %r11 + and %r11,%rcx + or %rcx,%r10 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + + mov %r10,%r11 + and %r9,%r11 #mask out the lower sign bit leaving the upper sign bit + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $31,%r11 #shift upper sign bit left by 31 bits + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r11,p_sign+8(%rsp) #write out upper sign bit + + jmp .L__vrd2_sin_approximate +#ENDMAIN + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrd2_sin_cleanup: + add $0x138,%rsp + ret +
diff --git a/src/gas/vrd2sincos.S b/src/gas/vrd2sincos.S new file mode 100644 index 0000000..b25bb37 --- /dev/null +++ b/src/gas/vrd2sincos.S
@@ -0,0 +1,968 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# A vector implementation of the libm sincos function. +# +# Prototype: +# +# __vrd2_sincos(__m128d x, __m128d* ys, __m128d* yc); +# +# Computes Sine and Cosine of x. +# It will provide proper C99 return values, +# but may not raise floating point status bits properly. +# Based on the NAG C implementation. +# Author: Harsha Jagasia +# Email: harsha.jagasia@amd.com + + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + +.data +.align 16 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__real_ffffffffffffffff: .quad 0x0ffffffffffffffff #Sign bit one + .quad 0x0ffffffffffffffff +.L__real_naninf_upper_sign_mask: .quad 0x000000000ffffffff # + .quad 0x000000000ffffffff # +.L__real_naninf_lower_sign_mask: .quad 0x0ffffffff00000000 # + .quad 0x0ffffffff00000000 # + +.Lcosarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03fa5555555555555 + .quad 0x0bf56c16c16c16967 # -0.00138889 c2 + .quad 0x0bf56c16c16c16967 + .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3 + .quad 0x03efa01a019f4ec90 + .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4 + .quad 0x0be927e4fa17f65f6 + .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5 + .quad 0x03e21eeb69037ab78 + .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6 + .quad 0x0bda907db46cc5e42 +.Lsinarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bfc5555555555555 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03f81111111110bb3 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0bf2a01a019e83e5c + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03ec71de3796cde01 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0be5ae600b42fdfa7 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x03de5e0b2f9a43bb8 +.Lsincosarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x0bf56c16c16c16967 # c2 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x03efa01a019f4ec90 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x0be927e4fa17f65f6 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x03e21eeb69037ab78 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x0bda907db46cc5e42 +.Lcossinarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bf56c16c16c16967 # c2 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03efa01a019f4ec90 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0be927e4fa17f65f6 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03e21eeb69037ab78 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0bda907db46cc5e42 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + + +.text +.align 16 +.p2align 4,,15 + +.equ p_temp, 0x00 # temporary for get/put bits operation +.equ p_temp1, 0x10 # temporary for get/put bits operation +.equ p_temp2, 0x20 # temporary for get/put bits operation + +.equ save_xmm6, 0x30 # temporary for get/put bits operation +.equ save_xmm7, 0x40 # temporary for get/put bits operation +.equ save_xmm8, 0x50 # temporary for get/put bits operation +.equ save_xmm9, 0x60 # temporary for get/put bits operation +.equ save_xmm10, 0x70 # temporary for get/put bits operation +.equ save_xmm11, 0x80 # temporary for get/put bits operation +.equ save_xmm12, 0x90 # temporary for get/put bits operation +.equ save_xmm13, 0x0A0 # temporary for get/put bits operation +.equ save_xmm14, 0x0B0 # temporary for get/put bits operation +.equ save_xmm15, 0x0C0 # temporary for get/put bits operation + +.equ save_rdi, 0x0D0 +.equ save_rsi, 0x0E0 + +.equ r, 0x0F0 # pointer to r for remainder_piby2 +.equ rr, 0x0100 # pointer to r for remainder_piby2 +.equ region, 0x0110 # pointer to r for remainder_piby2 + +.equ p_original, 0x0120 # original x +.equ p_mask, 0x0130 # original x +.equ p_sign, 0x0140 # original x +.equ p_sign1, 0x0150 # original x +.equ p_x, 0x0160 #x +.equ p_xx, 0x0170 #xx +.equ p_x2, 0x0180 #x2 +.equ p_sin, 0x0190 #sin +.equ p_cos, 0x01A0 #cos +.equ p_temp2, 0x01B0 # temporary for get/put bits operation + +.globl __vrd2_sincos + .type __vrd2_sincos,@function +__vrd2_sincos: + sub $0x1C8,%rsp + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#STARTMAIN + +movdqa %xmm0,%xmm6 #move to mem to get into integer regs ** +movdqa %xmm0, p_original(%rsp) #move to mem to get into integer regs - + +andpd .L__real_7fffffffffffffff(%rip),%xmm0 #Unsign - + +mov %rdi, p_sin(%rsp) # save address for sin return +mov %rsi, p_cos(%rsp) # save address for cos return + +movd %xmm0,%rax #rax is lower arg +movhpd %xmm0, p_temp+8(%rsp) # +mov p_temp+8(%rsp),%rcx #rcx = upper arg +movdqa %xmm0,%xmm8 + +pcmpgtd %xmm6,%xmm8 +movdqa %xmm8,%xmm6 +psrldq $4,%xmm8 +psrldq $8,%xmm6 + +mov $0x3FE921FB54442D18,%rdx #piby4 +mov $0x411E848000000000,%r10 #5e5 +movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + +por %xmm6,%xmm8 +movd %xmm8,%r11 #Move Sign to gpr ** + +movapd %xmm0,%xmm2 #x +movapd %xmm0,%xmm6 #x + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Leither_or_both_arg_gt_than_piby4: + + cmp %r10,%rax #is lower arg >= 5e5 + jae .Llower_or_both_arg_gt_5e5 + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lupper_arg_gt_5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lboth_arg_lt_than_5e5: +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + + +# /* Subtract the multiple from x to get an extra-precision remainder */ + + movd %xmm4,%r8 # Region + + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + mov %r8,%r10 + mov %r8,%rcx + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = npi2 (int), xmm0 =rhead, xmm8 =rtail + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + mov %r10,%rax + not %r11 #ADDED TO CHANGE THE LOGIC + and %r11,%r10 + not %rax + not %r11 + and %r11,%rax + or %rax,%r10 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + mov %r10,%r11 + and %rdx,%r11 #mask out the lower sign bit leaving the upper sign bit + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $31,%r11 #shift upper sign bit left by 31 bits + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r11,p_sign+8(%rsp) #write out upper sign bit + +# xmm4 = Sign, xmm0 =rhead, xmm8 =rtail + + movapd %xmm0,%xmm6 # rhead + subpd %xmm8,%xmm0 # r = rhead - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail + + subpd %xmm0,%xmm6 #rr=rhead-r + movapd %xmm0,%xmm2 #move r for r2 + mulpd %xmm0,%xmm2 #r2 + subpd %xmm8,%xmm6 #rr=(rhead-r) -rtail + + mov .L__reald_one_zero(%rip),%r9 # Compare value for cossin + + + + add .L__reald_one_one(%rip),%rcx + and .L__reald_two_two(%rip),%rcx + shr $1,%rcx + + mov %rcx,%rdx + and %r9,%rdx #mask out the lower sign bit leaving the upper sign bit + shl $63,%rcx #shift lower sign bit left by 63 bits + shl $31,%rdx #shift upper sign bit left by 31 bits + mov %rcx,p_sign1(%rsp) #write out lower sign bit + mov %rdx,p_sign1+8(%rsp) #write out upper sign bit + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +.align 16 +.L__vrd2_sincos_approximate: + cmp $0,%r8 + jnz .Lvrd2_not_sin_piby4 + +.Lvrd2_sin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + movapd %xmm2,%xmm10 # x2 + movapd %xmm2,%xmm11 # x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm2,%xmm5 # c6*x2 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm2,%xmm9 # c3*x2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + movapd %xmm2,p_temp(%rsp) # store x2 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + movapd %xmm10,p_temp2(%rsp) # store r + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + mulpd %xmm2,%xmm11 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm2,%xmm5 # x2(c5+x2c6) + movapd %xmm10,p_temp1(%rsp) # store t + movapd %xmm11,%xmm3 # Keep x4 + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm2,%xmm9 # x2(c2+x2C3) + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + mulpd %xmm2,%xmm11 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + subpd p_temp2(%rsp),%xmm10 # (1 + (-t)) - r + mulpd %xmm0,%xmm2 # x3 recalculate + + mulpd %xmm11,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6)) + + movapd %xmm0,%xmm1 + movapd %xmm6,%xmm7 + mulpd %xmm6,%xmm1 # x*xx + mulpd p_temp2(%rsp),%xmm7 # xx * 0.5x2 + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zs + + subpd %xmm1,%xmm10 # ((1 + (-t)) - r) -x*xx + + mulpd %xmm3,%xmm4 # x4 * zc + mulpd %xmm2,%xmm5 # x3 * zs + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + subpd %xmm7,%xmm5 # x3*zs - 0.5 * x2 *xx + + addpd %xmm6,%xmm5 # sin + xx + subpd p_temp1(%rsp),%xmm4 # cos - (-t) + addpd %xmm0,%xmm5 # sin + x + + jmp .L__vrd2_sincos_cleanup + +.align 16 +.Lvrd2_not_sin_piby4: + cmp .L__reald_one_one(%rip),%r8 + jnz .Lvrd2_not_cos_piby4 + +.Lvrd2_cos_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr + + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + + movapd %xmm2,%xmm10 # x2 + movapd %xmm2,%xmm11 # x2 + + mulpd %xmm2,%xmm5 # c6*x2 + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm2,%xmm9 # c3*x2 + mulpd %xmm2,%xmm8 # c3*x2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + movapd %xmm2,p_temp(%rsp) # store x2 + + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + movapd %xmm10,p_temp2(%rsp) # store r + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + mulpd %xmm2,%xmm11 # x4 + + mulpd %xmm2,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + movapd %xmm10,p_temp1(%rsp) # store t + movapd %xmm11,%xmm3 # Keep x4 + mulpd %xmm2,%xmm9 # x2(c2+x2C3) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + mulpd %xmm2,%xmm11 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + + subpd p_temp2(%rsp),%xmm10 # (1 + (-t)) - r + mulpd %xmm0,%xmm2 # x3 recalculate + + mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm11,%xmm4 # x6(c4 + x2(c5+x2c6)) + + movapd %xmm0,%xmm1 + movapd %xmm6,%xmm7 + mulpd %xmm6,%xmm1 # x*xx + mulpd p_temp2(%rsp),%xmm7 # xx * 0.5x2 + + addpd %xmm9,%xmm5 # zc + addpd %xmm8,%xmm4 # zs + + subpd %xmm1,%xmm10 # ((1 + (-t)) - r) -x*xx + + mulpd %xmm3,%xmm5 # x4 * zc + mulpd %xmm2,%xmm4 # x3 * zs + + addpd %xmm10,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + subpd %xmm7,%xmm4 # x3*zs - 0.5 * x2 *xx + + addpd %xmm6,%xmm4 # sin + xx + subpd p_temp1(%rsp),%xmm5 # cos - (-t) + addpd %xmm0,%xmm4 # sin + x + + jmp .L__vrd2_sincos_cleanup + +.align 16 +.Lvrd2_not_cos_piby4: + cmp $1,%r8 + jnz .Lvrd2_cossin_piby4 + +.Lvrd2_sincos_piby4: + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + movapd %xmm2,%xmm10 # x2 + movapd %xmm2,%xmm11 # x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm2,%xmm5 # c6*x2 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm2,%xmm9 # c3*x2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + movapd %xmm2,p_temp(%rsp) # store x2 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + movapd %xmm10,p_temp2(%rsp) # store r + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + mulpd %xmm2,%xmm11 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm2,%xmm5 # x2(c5+x2c6) + movapd %xmm10,p_temp1(%rsp) # store t + movapd %xmm11,%xmm3 # Keep x4 + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm2,%xmm9 # x2(c2+x2C3) + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + mulpd %xmm2,%xmm11 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + subpd p_temp2(%rsp),%xmm10 # (1 + (-t)) - r + mulpd %xmm0,%xmm2 # x3 recalculate + + mulpd %xmm11,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6)) + + movapd %xmm0,%xmm1 + movapd %xmm6,%xmm7 + mulpd %xmm6,%xmm1 # x*xx + mulpd p_temp2(%rsp),%xmm7 # xx * 0.5x2 + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zs + + subpd %xmm1,%xmm10 # ((1 + (-t)) - r) -x*xx + + mulpd %xmm3,%xmm4 # x4 * zc + mulpd %xmm2,%xmm5 # x3 * zs + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + subpd %xmm7,%xmm5 # x3*zs - 0.5 * x2 *xx + + addpd %xmm6,%xmm5 # sin + xx + subpd p_temp1(%rsp),%xmm4 # cos - (-t) + addpd %xmm0,%xmm5 # sin + x + + movsd %xmm4,%xmm1 + movsd %xmm5,%xmm4 + movsd %xmm1,%xmm5 + + jmp .L__vrd2_sincos_cleanup + +.align 16 +.Lvrd2_cossin_piby4: + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + + movapd %xmm2,%xmm10 # x2 + movapd %xmm2,%xmm11 # x2 + + mulpd %xmm2,%xmm5 # c6*x2 + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm2,%xmm9 # c3*x2 + mulpd %xmm2,%xmm8 # c3*x2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + movapd %xmm2,p_temp(%rsp) # store x2 + + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + movapd %xmm10,p_temp2(%rsp) # store r + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + mulpd %xmm2,%xmm11 # x4 + + mulpd %xmm2,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + movapd %xmm10,p_temp1(%rsp) # store t + movapd %xmm11,%xmm3 # Keep x4 + mulpd %xmm2,%xmm9 # x2(c2+x2C3) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + mulpd %xmm2,%xmm11 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + + subpd p_temp2(%rsp),%xmm10 # (1 + (-t)) - r + mulpd %xmm0,%xmm2 # x3 recalculate + + mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm11,%xmm4 # x6(c4 + x2(c5+x2c6)) + + movapd %xmm0,%xmm1 + movapd %xmm6,%xmm7 + mulpd %xmm6,%xmm1 # x*xx + mulpd p_temp2(%rsp),%xmm7 # xx * 0.5x2 + + addpd %xmm9,%xmm5 # zc + addpd %xmm8,%xmm4 # zs + + subpd %xmm1,%xmm10 # ((1 + (-t)) - r) -x*xx + + mulpd %xmm3,%xmm5 # x4 * zc + mulpd %xmm2,%xmm4 # x3 * zs + + addpd %xmm10,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + subpd %xmm7,%xmm4 # x3*zs - 0.5 * x2 *xx + + addpd %xmm6,%xmm4 # sin + xx + subpd p_temp1(%rsp),%xmm5 # cos - (-t) + addpd %xmm0,%xmm4 # sin + x + + movsd %xmm5,%xmm1 + movsd %xmm4,%xmm5 + movsd %xmm1,%xmm4 + + jmp .L__vrd2_sincos_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Llower_or_both_arg_gt_5e5: + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 + + movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call + + movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm6,%xmm6 + +# Work on Upper arg +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + +#If upper Arg is <=piby4 + cmp %rdx,%rcx # is upper arg > piby4 + ja 0f + + mov $0,%ecx # region = 0 + mov %ecx,region+4(%rsp) # store upper region + movlpd %xmm0,r+8(%rsp) # store upper r (unsigned - sign is adjusted later based on sign) + xorpd %xmm4,%xmm4 # rr = 0 + movlpd %xmm4,rr+8(%rsp) # store upper rr + jmp .Lcheck_lower_arg + +#If upper Arg is > piby4 +.align 16 +0: + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm3 # piby2_1 + cvttsd2si %xmm2,%ecx # npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm1 # piby2_2 + cvtsi2sd %ecx,%xmm2 # npi2 trunc to doubles + + #/* Subtract the multiple from x to get an extra-precision remainder */ + #rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm3 # npi2 * piby2_1 + subsd %xmm3,%xmm6 # rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm8 # piby2_2tail + + #t = rhead; + movsd %xmm6,%xmm5 # t = rhead + + #rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm1 # rtail=(npi2*piby2_2) + + #rhead = t - rtail + subsd %xmm1,%xmm6 # rhead=(t-rtail) + + #rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm8 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm1 # (rtail-(t-rhead)) + addsd %xmm8,%xmm1 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + + #r = rhead - rtail + #rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm6,%xmm0 + subsd %xmm1,%xmm0 # r=(rhead-rtail) + + subsd %xmm0,%xmm6 # rr=rhead-r + subsd %xmm1,%xmm6 # xmm4 = rr=((rhead-r) -rtail) + + movlpd %xmm0,r+8(%rsp) # store upper r + movlpd %xmm6,rr+8(%rsp) # store upper rr + +#If lower Arg is > 5e5 +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign +.align 16 +.Lcheck_lower_arg: + mov $0x07ff0000000000000,%r9 # is lower arg nan/inf + mov %r9,%r10 + and %rax,%r10 + cmp %r9,%r10 + jz .L__vrd2_cos_lower_naninf + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call + mov %r11,p_temp(%rsp) #Save Sign + call __amd_remainder_piby2@PLT + mov p_temp(%rsp),%r11 #Restore Sign + + jmp .L__vrd2_cos_reconstruct + +.L__vrd2_cos_lower_naninf: + mov p_original(%rsp),%rax # upper arg is nan/inf + + mov $0x00008000000000000,%r9 + or %r9,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) # rr = 0 + mov %r10d,region(%rsp) # region =0 + and .L__real_naninf_lower_sign_mask(%rip),%r11 # Sign + + jmp .L__vrd2_cos_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lupper_arg_gt_5e5: +# Upper Arg is >= 5e5, Lower arg is < 5e5 + movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call +# movlhps %xmm0,%xmm0 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case +# movlhps %xmm2,%xmm2 +# movlhps %xmm6,%xmm6 + +# Work on Lower arg +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + +#If lower Arg is <=piby4 + cmp %rdx,%rax # is upper arg > piby4 + ja 0f + + mov $0,%eax # region = 0 + mov %eax,region(%rsp) # store upper region + movlpd %xmm0,r(%rsp) # store upper r + xorpd %xmm4,%xmm4 # rr = 0 + movlpd %xmm4,rr(%rsp) # store upper rr + jmp .Lcheck_upper_arg + +.align 16 +0: +#If upper Arg is > piby4 + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm3 # piby2_1 + cvttsd2si %xmm2,%eax # npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm1 # piby2_2 + cvtsi2sd %eax,%xmm2 # npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm3 # npi2 * piby2_1; + subsd %xmm3,%xmm6 # rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm8 # piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm1 # rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm1,%xmm6 # rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm8 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm1 # (rtail-(t-rhead)) + addsd %xmm8,%xmm1 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store lower region + movsd %xmm6,%xmm0 + subsd %xmm1,%xmm0 # r=(rhead-rtail) + subsd %xmm0,%xmm6 # rr=rhead-r + subsd %xmm1,%xmm6 # rr=((rhead-r) -rtail) + movlpd %xmm0,r(%rsp) # store lower r + movlpd %xmm6,rr(%rsp) # store lower rr + +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign +.align 16 +.Lcheck_upper_arg: + mov $0x07ff0000000000000,%r9 # is upper arg nan/inf + mov %r9,%r10 + and %rcx,%r10 + cmp %r9,%r10 + jz .L__vrd2_cos_upper_naninf + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + mov %r11,p_temp(%rsp) #Save Sign + call __amd_remainder_piby2@PLT + mov p_temp(%rsp),%r11 #Restore Sign + + jmp .L__vrd2_cos_reconstruct + +.L__vrd2_cos_upper_naninf: + mov p_original+8(%rsp),%rcx # upper arg is nan/inf + mov $0x00008000000000000,%r9 + or %r9,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) # rr = 0 + mov %r10d,region+4(%rsp) # region =0 + and .L__real_naninf_upper_sign_mask(%rip),%r11 # Sign + jmp .L__vrd2_cos_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 + + movhpd %xmm0, p_temp2(%rsp) #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r9 #is lower arg nan/inf + mov %r9,%r10 + and %rax,%r10 + cmp %r9,%r10 + jz .L__vrd2_cos_lower_naninf_of_both_gt_5e5 + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + mov %rcx,p_temp(%rsp) #Save upper arg + mov %r11,p_temp1(%rsp) #Save Sign + call __amd_remainder_piby2@PLT + mov p_temp1(%rsp),%r11 #Restore Sign + mov p_temp(%rsp),%rcx #Restore upper arg + jmp 0f + +.L__vrd2_cos_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf + mov p_original(%rsp),%rax + mov $0x00008000000000000,%r9 + or %r9,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) #rr = 0 + mov %r10d,region(%rsp) #region = 0 + and .L__real_naninf_lower_sign_mask(%rip),%r11 # Sign + +.align 16 +0: + mov $0x07ff0000000000000,%r9 #is upper arg nan/inf + mov %r9,%r10 + and %rcx,%r10 + cmp %r9,%r10 + jz .L__vrd2_cos_upper_naninf_of_both_gt_5e5 + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd p_temp2(%rsp), %xmm0 #Restore upper fp arg for remainder_piby2 call + mov %r11,p_temp(%rsp) #Save Sign + call __amd_remainder_piby2@PLT + mov p_temp(%rsp),%r11 #Restore Sign + + jmp 0f + +.L__vrd2_cos_upper_naninf_of_both_gt_5e5: + mov p_original+8(%rsp),%rcx #upper arg is nan/inf + mov $0x00008000000000000,%r9 + or %r9,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) #rr = 0 + mov %r10d,region+4(%rsp) #region = 0 + and .L__real_naninf_upper_sign_mask(%rip),%r11 # Sign + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +0: +.L__vrd2_cos_reconstruct: +#Construct p_sign=Sign for Sin term, p_sign1=Sign for Cos term, xmm0 = r, xmm2 = %xmm6,%r2 =rr, r8=region + movapd r(%rsp),%xmm0 #x + movapd %xmm0,%xmm2 #move for x2 + mulpd %xmm2,%xmm2 #x2 + movapd rr(%rsp),%xmm6 #xx + + mov region(%rsp),%r8 + mov .L__reald_one_zero(%rip),%r9 #compare value for cossin path + mov %r8,%r10 + mov %r8,%rax + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + mov %r10,%rcx + not %r11 #ADDED TO CHANGE THE LOGIC + and %r11,%r10 + not %rcx + not %r11 + and %r11,%rcx + or %rcx,%r10 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + + mov %r10,%r11 + and %r9,%r11 #mask out the lower sign bit leaving the upper sign bit + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $31,%r11 #shift upper sign bit left by 31 bits + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r11,p_sign+8(%rsp) #write out upper sign bit + + add .L__reald_one_one(%rip),%rax + and .L__reald_two_two(%rip),%rax + shr $1,%rax + + mov %rax,%rdx + and %r9,%rdx #mask out the lower sign bit leaving the upper sign bit + shl $63,%rax #shift lower sign bit left by 63 bits + shl $31,%rdx #shift upper sign bit left by 31 bits + mov %rax,p_sign1(%rsp) #write out lower sign bit + mov %rdx,p_sign1+8(%rsp) #write out upper sign bit + + + jmp .L__vrd2_sincos_approximate + + +#ENDMAIN + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrd2_sincos_cleanup: + + xorpd p_sign(%rsp),%xmm5 # SIN sign + xorpd p_sign1(%rsp),%xmm4 # COS sign + + mov p_sin(%rsp),%rdi + mov p_cos(%rsp),%rsi + + movapd %xmm5,(%rdi) # save the sin + movapd %xmm4,(%rsi) # save the cos + +.Lfinal_check: + add $0x1C8,%rsp + ret +
diff --git a/src/gas/vrd4cos.S b/src/gas/vrd4cos.S new file mode 100644 index 0000000..5ecc97c --- /dev/null +++ b/src/gas/vrd4cos.S
@@ -0,0 +1,2987 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# vrd4cos.s +# +# A vector implementation of the cos libm function. +# +# Prototype: +# +# __m128d,__m128d __vrd4_cos(__m128d x1, __m128d x2); +# +# Computes Cosine of x for an array of input values. +# Places the results into the supplied y array. +# Does not perform error checking. +# Denormal inputs may produce unexpected results. +# This routine computes 4 double precision Cosine values at a time. +# The four values are passed as packed doubles in xmm0 and xmm1. +# The four results are returned as packed doubles in xmm0 and xmm1. +# Note that this represents a non-standard ABI usage, as no ABI +# ( and indeed C) currently allows returning 2 values for a function. +# It is expected that some compilers may be able to take advantage of this +# interface when implementing vectorized loops. Using the array implementation +# of the routine requires putting the inputs into memory, and retrieving +# the results from memory. This routine eliminates the need for this +# overhead if the data does not already reside in memory. +# This routine is derived directly from the array version. +# Author: Harsha Jagasia +# Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + + +.data +.align 16 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # + +.Lcosarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03fa5555555555555 + .quad 0x0bf56c16c16c16967 # -0.00138889 c2 + .quad 0x0bf56c16c16c16967 + .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3 + .quad 0x03efa01a019f4ec90 + .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4 + .quad 0x0be927e4fa17f65f6 + .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5 + .quad 0x03e21eeb69037ab78 + .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6 + .quad 0x0bda907db46cc5e42 +.Lsinarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bfc5555555555555 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03f81111111110bb3 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0bf2a01a019e83e5c + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03ec71de3796cde01 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0be5ae600b42fdfa7 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x03de5e0b2f9a43bb8 +.Lsincosarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x0bf56c16c16c16967 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x03efa01a019f4ec90 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x0be927e4fa17f65f6 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x03e21eeb69037ab78 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x0bda907db46cc5e42 +.Lcossinarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bf56c16c16c16967 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03efa01a019f4ec90 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0be927e4fa17f65f6 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03e21eeb69037ab78 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0bda907db46cc5e42 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + +.align 16 +.Levencos_oddsin_tbl: + .quad .Lcoscos_coscos_piby4 # 0 * + .quad .Lcoscos_cossin_piby4 # 1 + + .quad .Lcoscos_sincos_piby4 # 2 + .quad .Lcoscos_sinsin_piby4 # 3 + + + .quad .Lcossin_coscos_piby4 # 4 + .quad .Lcossin_cossin_piby4 # 5 * + .quad .Lcossin_sincos_piby4 # 6 + .quad .Lcossin_sinsin_piby4 # 7 + + .quad .Lsincos_coscos_piby4 # 8 + .quad .Lsincos_cossin_piby4 # 9 + .quad .Lsincos_sincos_piby4 # 10 * + .quad .Lsincos_sinsin_piby4 # 11 + + .quad .Lsinsin_coscos_piby4 # 12 + .quad .Lsinsin_cossin_piby4 # 13 + + .quad .Lsinsin_sincos_piby4 # 14 + .quad .Lsinsin_sinsin_piby4 # 15 * + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + .text + .align 16 + .p2align 4,,15 + +# define local variable storage offsets +.equ p_temp, 0x00 # temporary for get/put bits operation +.equ p_temp1, 0x10 # temporary for get/put bits operation + +.equ p_xmm6, 0x20 # temporary for get/put bits operation +.equ p_xmm7, 0x30 # temporary for get/put bits operation +.equ p_xmm8, 0x40 # temporary for get/put bits operation +.equ p_xmm9, 0x50 # temporary for get/put bits operation +.equ p_xmm10, 0x60 # temporary for get/put bits operation +.equ p_xmm11, 0x70 # temporary for get/put bits operation +.equ p_xmm12, 0x80 # temporary for get/put bits operation +.equ p_xmm13, 0x90 # temporary for get/put bits operation +.equ p_xmm14, 0x0A0 # temporary for get/put bits operation +.equ p_xmm15, 0x0B0 # temporary for get/put bits operation + +.equ r, 0x0C0 # pointer to r for remainder_piby2 +.equ rr, 0x0D0 # pointer to r for remainder_piby2 +.equ region, 0x0E0 # pointer to r for remainder_piby2 + +.equ r1, 0x0F0 # pointer to r for remainder_piby2 +.equ rr1, 0x0100 # pointer to r for remainder_piby2 +.equ region1, 0x0110 # pointer to r for remainder_piby2 + +.equ p_temp2, 0x0120 # temporary for get/put bits operation +.equ p_temp3, 0x0130 # temporary for get/put bits operation + +.equ p_temp4, 0x0140 # temporary for get/put bits operation +.equ p_temp5, 0x0150 # temporary for get/put bits operation + +.equ p_original, 0x0160 # original x +.equ p_mask, 0x0170 # original x +.equ p_sign, 0x0180 # original x + +.equ p_original1, 0x0190 # original x +.equ p_mask1, 0x01A0 # original x +.equ p_sign1, 0x01B0 # original x + +.globl __vrd4_cos + .type __vrd4_cos,@function +__vrd4_cos: + sub $0x1C8,%rsp + +#DEBUG +# add $0x1C8,%rsp +# ret +# movapd %xmm0,%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#STARTMAIN + +movapd .L__real_7fffffffffffffff(%rip),%xmm2 +movdqa %xmm0, p_original(%rsp) +movdqa %xmm1, p_original1(%rsp) + +andpd %xmm2,%xmm0 #Unsign +andpd %xmm2,%xmm1 #Unsign + +movd %xmm0,%rax #rax is lower arg +movhpd %xmm0, p_temp+8(%rsp) # +mov p_temp+8(%rsp),%rcx #rcx = upper arg +movd %xmm1,%r8 #rax is lower arg +movhpd %xmm1, p_temp1+8(%rsp) # +mov p_temp1+8(%rsp),%r9 #rcx = upper arg + +mov $0x3FE921FB54442D18,%rdx #piby4 + +mov $0x411E848000000000,%r10 #5e5 + + +movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + +movapd %xmm0,%xmm2 #x0 +movapd %xmm1,%xmm3 #x1 +movapd %xmm0,%xmm6 #x0 +movapd %xmm1,%xmm7 #x1 + +#DEBUG +# add $0x1C8,%rsp +# ret +# movapd %xmm0,%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm2 = x, xmm4 =0.5/t, xmm6 =x +# xmm3 = x, xmm5 =0.5/t, xmm7 =x +.align 16 +.Leither_or_both_arg_gt_than_piby4: + cmp %r10,%rax + jae .Lfirst_or_next3_arg_gt_5e5 + + cmp %r10,%rcx + jae .Lsecond_or_next2_arg_gt_5e5 + + cmp %r10,%r8 + jae .Lthird_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfourth_arg_gt_5e5 + + +# /* Find out what multiple of piby2 */ +# npi2 = (int)(x * twobypi + 0.5); + movapd .L__real_3fe45f306dc9c883(%rip),%xmm0 + mulpd %xmm0,%xmm2 # * twobypi + mulpd %xmm0,%xmm3 # * twobypi + + addpd %xmm4,%xmm2 # +0.5, npi2 + addpd %xmm4,%xmm3 # +0.5, npi2 + + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + + cvtdq2pd %xmm4,%xmm2 # and back to double. + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + + movd %xmm4,%rax # Region + movd %xmm5,%rcx # Region + + mov %rax,%r8 + mov %rcx,%r9 + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm1,%xmm7 # t-rhead + + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = npi2 (int), xmm0 =rhead, xmm8 =rtail +# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail + +# paddd .L__reald_one_one(%rip),%xmm4 ; Sign +# paddd .L__reald_one_one(%rip),%xmm5 ; Sign +# pand .L__reald_two_two(%rip),%xmm4 +# pand .L__reald_two_two(%rip),%xmm5 +# punpckldq %xmm4,%xmm4 +# punpckldq %xmm5,%xmm5 +# psllq $62,%xmm4 +# psllq $62,%xmm5 + + + add .L__reald_one_one(%rip),%r8 + add .L__reald_one_one(%rip),%r9 + and .L__reald_two_two(%rip),%r8 + and .L__reald_two_two(%rip),%r9 + + mov %r8,%r10 + mov %r9,%r11 + shl $62,%r8 + and .L__reald_two_zero(%rip),%r10 + shl $30,%r10 + shl $62,%r9 + and .L__reald_two_zero(%rip),%r11 + shl $30,%r11 + + mov %r8,p_sign(%rsp) + mov %r10,p_sign+8(%rsp) + mov %r9,p_sign1(%rsp) + mov %r11,p_sign1+8(%rsp) + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead +# xmm4 = Sign, xmm0 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail + movapd %xmm0,%xmm6 # rhead + movapd %xmm1,%xmm7 # rhead + + and .L__reald_one_one(%rip),%rax # Region + and .L__reald_one_one(%rip),%rcx # Region + + subpd %xmm8,%xmm0 # r = rhead - rtail + subpd %xmm9,%xmm1 # r = rhead - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail + + subpd %xmm0,%xmm6 #rr=rhead-r + subpd %xmm1,%xmm7 #rr=rhead-r + + mov %rax,%r8 + mov %rcx,%r9 + + movapd %xmm0,%xmm2 + movapd %xmm1,%xmm3 + + mulpd %xmm0,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + subpd %xmm8,%xmm6 #rr=(rhead-r) -rtail + subpd %xmm9,%xmm7 #rr=(rhead-r) -rtail + + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + leaq .Levencos_oddsin_tbl(%rip),%rsi + jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfirst_or_next3_arg_gt_5e5: +# %rcx,,%rax r8, r9 + +#DEBUG +# movapd %xmm0,%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Be sure not to use %xmm3,%xmm1 and xmm7 +# Use %xmm8,,%xmm5 xmm10, xmm12 +# %xmm11,,%xmm9 xmm13 + + +#DEBUG +# movapd %xmm0,%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + + movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm6,%xmm6 + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1 + cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm6,%xmm0 + subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm6 # rr=rhead-r + subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm0,r+8(%rsp) # store upper r + movlpd %xmm6,rr+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_lower_naninf + + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrd4_cos_lower_naninf: + mov p_original(%rsp),%rax # upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) # rr = 0 + mov %r10d,region(%rsp) # region =0 + +.align 16 +0: + + +#DEBUG +# movapd .LOWORD,%xmm4 PTR r[rsp] +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + + + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + + +#DEBUG +# movapd %xmm0,%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + + + + movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_lower_naninf_of_both_gt_5e5 + + mov %rcx,p_temp(%rsp) #Save upper arg + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%rcx #Restore upper arg + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrd4_cos_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf + mov p_original(%rsp),%rax + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) #rr = 0 + mov %r10d,region(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_upper_naninf_of_both_gt_5e5 + + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrd4_cos_upper_naninf_of_both_gt_5e5: + mov p_original+8(%rsp),%rcx #upper arg is nan/inf +# movd %xmm6,%rcx ;upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) #rr = 0 + mov %r10d,region+4(%rsp) #region = 0 + +.align 16 +0: + jmp .Lcheck_next2_args + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsecond_or_next2_arg_gt_5e5: +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Restore xmm4 and %xmm3,,%xmm1 xmm7 +# Can use %xmm10,,%xmm8 xmm12 +# %xmm9,,%xmm5 xmm11, xmm13 + + movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call +# movlhps %xmm0,%xmm0 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case +# movlhps %xmm2,%xmm2 +# movlhps %xmm6,%xmm6 + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1 + cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm1 = piby2_2 + cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store upper region + movsd %xmm6,%xmm0 + subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm6 # rr=rhead-r + subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm0,r(%rsp) # store upper r + movlpd %xmm6,rr(%rsp) # store upper rr + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_upper_naninf + + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrd4_cos_upper_naninf: + mov p_original+8(%rsp),%rcx # upper arg is nan/inf +# mov r+8(%rsp),%rcx ; upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) # rr = 0 + mov %r10d,region+4(%rsp) # region =0 + +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcheck_next2_args: + +#DEBUG +# movapd r(%rsp),%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + mov $0x411E848000000000,%r10 #5e5 + + + cmp %r10,%r8 + jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfirst_second_done_fourth_arg_gt_5e5 + +# Work on next two args, both < 5e5 +# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5 + + movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi + addpd %xmm4,%xmm3 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm5,region1(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm1,%xmm7 # t-rhead + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm1,%xmm7 # rhead + subpd %xmm9,%xmm1 # r = rhead - rtail + movapd %xmm1,r1(%rsp) + + subpd %xmm1,%xmm7 # rr=rhead-r + subpd %xmm9,%xmm7 # rr=(rhead-r) -rtail + movapd %xmm7,rr1(%rsp) + + jmp .L__vrd4_cos_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lthird_or_fourth_arg_gt_5e5: +#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Can use %xmm11,,%xmm9 xmm13 +# %xmm8,,%xmm5 xmm10, xmm12 +# Restore xmm4 + +# Work on first two args, both < 5e5 + +#DEBUG +# movapd %xmm0,%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm4,region(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm0,%xmm6 # rhead + subpd %xmm8,%xmm0 # r = rhead - rtail + movapd %xmm0,r(%rsp) + + subpd %xmm0,%xmm6 # rr=rhead-r + subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail + movapd %xmm6,rr(%rsp) + + +# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_third_or_fourth_arg_gt_5e5: +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + + +#DEBUG +# movapd r(%rsp),%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + mov $0x411E848000000000,%r10 #5e5 + + cmp %r10,%r9 + jae .Lboth_arg_gt_5e5_higher + + +# Upper Arg is <5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg + movhlps %xmm3,%xmm3 + movhlps %xmm7,%xmm7 + movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r9d,region1+4(%rsp) # store upper region + movsd %xmm7,%xmm1 + subsd %xmm0,%xmm1 # xmm1 = r=(rhead-rtail) + subsd %xmm1,%xmm7 # rr=rhead-r + subsd %xmm0,%xmm7 # xmm7 = rr=((rhead-r) -rtail) + movlpd %xmm1,r1+8(%rsp) # store upper r + movlpd %xmm7,rr1+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_lower_naninf_higher + + lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf + lea rr1(%rsp),%rsi + lea r1(%rsp),%rdi + movlpd r1(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_cos_lower_naninf_higher: + mov p_original1(%rsp),%r8 # upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1(%rsp) # rr = 0 + mov %r10d,region1(%rsp) # region =0 + +.align 16 +0: + + +#DEBUG +# movapd rr(%rsp),%xmm4 +# movapd rr1(%rsp),%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + jmp .L__vrd4_cos_reconstruct + + + + + + + +.align 16 +.Lboth_arg_gt_5e5_higher: +# Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + +#DEBUG +# movapd r(%rsp),%xmm4 +# movd %r8,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_lower_naninf_of_both_gt_5e5_higher + + mov %r9,p_temp1(%rsp) #Save upper arg + lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf + lea rr1(%rsp),%rsi + lea r1(%rsp),%rdi + movsd %xmm1,%xmm0 + call __amd_remainder_piby2@PLT + mov p_temp1(%rsp),%r9 #Restore upper arg + + +#DEBUG +# movapd r(%rsp),%xmm4 +# mov QWORD PTR r1[rsp+8], r9 +# movapd r1(%rsp),%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + + + jmp 0f + +.L__vrd4_cos_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf + mov p_original1(%rsp),%r8 + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1(%rsp) #rr = 0 + mov %r10d,region1(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_upper_naninf_of_both_gt_5e5_higher + + lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea rr1+8(%rsp),%rsi + lea r1+8(%rsp),%rdi + movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_cos_upper_naninf_of_both_gt_5e5_higher: + mov p_original1+8(%rsp),%r9 #upper arg is nan/inf +# movd %xmm6,%r9 ;upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1+8(%rsp) #rr = 0 + mov %r10d,region1+4(%rsp) #region = 0 + +.align 16 +0: + +#DEBUG +# movapd r(%rsp),%xmm4 +# movapd r1(%rsp),%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + + jmp .L__vrd4_cos_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfourth_arg_gt_5e5: +#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5 +#%rcx,,%rax r8, r9 +#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + +# Work on first two args, both < 5e5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm4,region(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm0,%xmm6 # rhead + subpd %xmm8,%xmm0 # r = rhead - rtail + movapd %xmm0,r(%rsp) + + subpd %xmm0,%xmm6 # rr=rhead-r + subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail + movapd %xmm6,rr(%rsp) + + +# Work on next two args, third arg < 5e5, fourth arg >= 5e5 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_fourth_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call +# movlhps %xmm1,%xmm1 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case +# movlhps %xmm3,%xmm3 +# movlhps %xmm7,%xmm7 + movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r8d,region1(%rsp) # store lower region + movsd %xmm7,%xmm1 + subsd %xmm0,%xmm1 # xmm0 = r=(rhead-rtail) + subsd %xmm1,%xmm7 # rr=rhead-r + subsd %xmm0,%xmm7 # xmm6 = rr=((rhead-r) -rtail) + + movlpd %xmm1,r1(%rsp) # store upper r + movlpd %xmm7,rr1(%rsp) # store upper rr + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_upper_naninf_higher + + lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea rr1+8(%rsp),%rsi + lea r1+8(%rsp),%rdi + movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_cos_upper_naninf_higher: + mov p_original1+8(%rsp),%r9 # upper arg is nan/inf +# mov r1+8(%rsp),%r9 # upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1+8(%rsp) # rr = 0 + mov %r10d,region1+4(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrd4_cos_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrd4_cos_reconstruct: +#Results +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +#DEBUG +# movapd region(%rsp),%xmm4 +# movapd region1(%rsp),%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + movapd r(%rsp),%xmm0 + movapd r1(%rsp),%xmm1 + + movapd rr(%rsp),%xmm6 + movapd rr1(%rsp),%xmm7 + + mov region(%rsp),%rax + mov region1(%rsp),%rcx + + mov %rax,%r8 + mov %rcx,%r9 + + add .L__reald_one_one(%rip),%r8 + add .L__reald_one_one(%rip),%r9 + and .L__reald_two_two(%rip),%r8 + and .L__reald_two_two(%rip),%r9 + + mov %r8,%r10 + mov %r9,%r11 + shl $62,%r8 + and .L__reald_two_zero(%rip),%r10 + shl $30,%r10 + shl $62,%r9 + and .L__reald_two_zero(%rip),%r11 + shl $30,%r11 + + mov %r8,p_sign(%rsp) + mov %r10,p_sign+8(%rsp) + mov %r9,p_sign1(%rsp) + mov %r11,p_sign1+8(%rsp) + + and .L__reald_one_one(%rip),%rax # Region + and .L__reald_one_one(%rip),%rcx # Region + + mov %rax,%r8 + mov %rcx,%r9 + + movapd %xmm0,%xmm2 + movapd %xmm1,%xmm3 + + mulpd %xmm0,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + + +#DEBUG +# movd %rax,%xmm4 +# movd %rax,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + leaq .Levencos_oddsin_tbl(%rip),%rsi + jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrd4_cos_cleanup: + + movapd p_sign(%rsp),%xmm0 + movapd p_sign1(%rsp),%xmm1 + + xorpd %xmm4,%xmm0 # (+) Sign + xorpd %xmm5,%xmm1 # (+) Sign + + add $0x1C8,%rsp + ret + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_coscos_piby4: + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 ;trash r + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm2 # x4 recalculate + mulpd %xmm3,%xmm3 # x4 recalculate + + movapd p_temp2(%rsp),%xmm12 # r + movapd p_temp3(%rsp),%xmm13 # r + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm1,%xmm7 # x * xx + + subpd %xmm12,%xmm10 # (1 + (-t)) - r + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + + subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1 + subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1 + + subpd %xmm12,%xmm4 # + t + subpd %xmm13,%xmm5 # + t + + jmp .L__vrd4_cos_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcossin_cossin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + + movdqa .Lsincosarray+0x50(%rip),%xmm4 # s6 + movdqa .Lsincosarray+0x50(%rip),%xmm5 # s6 + movapd .Lsincosarray+0x20(%rip),%xmm8 # s3 + movapd .Lsincosarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + movhlps %xmm10,%xmm10 # move high x4 for cos term + movhlps %xmm11,%xmm11 # move high x4 for cos term + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term + movsd %xmm3,%xmm7 # move low x2 for x3 for sin term + mulsd %xmm0,%xmm6 # get low x3 for sin term + mulsd %xmm1,%xmm7 # get low x3 for sin term + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high r for cos + movhlps %xmm3,%xmm13 # move high r for cos + movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos + movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm7,%xmm5 # sin *x3 + mulsd %xmm10,%xmm8 # cos *x4 + mulsd %xmm11,%xmm9 # cos *x4 + + mulsd p_temp(%rsp),%xmm2 # 0.5 * x2 * xx for sin term + mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term + movsd %xmm12,%xmm6 # Keep high r for cos term + movsd %xmm13,%xmm7 # Keep high r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t=r-1.0 + + subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx + subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx + + movhlps %xmm0,%xmm10 # move high x for x*xx for cos term + movhlps %xmm1,%xmm11 # move high x for x*xx for cos term + + mulsd p_temp+8(%rsp),%xmm10 # x * xx + mulsd p_temp1+8(%rsp),%xmm11 # x * xx + + movsd %xmm12,%xmm2 # move -t for cos term + movsd %xmm13,%xmm3 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm12 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm13 # 1+(-t) + addsd p_temp(%rsp),%xmm4 # sin+xx + addsd p_temp1(%rsp),%xmm5 # sin+xx + subsd %xmm6,%xmm12 # (1-t) - r + subsd %xmm7,%xmm13 # (1-t) - r + subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx + subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx + addsd %xmm0,%xmm4 # sin + x + addsd %xmm1,%xmm5 # sin + x + addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx) + addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx) + subsd %xmm2,%xmm8 # cos+t + subsd %xmm3,%xmm9 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + jmp .L__vrd4_cos_cleanup + +.align 16 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lsincos_cossin_piby4: # changed from sincos_sincos + # xmm1 is cossin and xmm0 is sincos + + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + movapd %xmm1,p_temp3(%rsp) # Store r for the sincos term + + movapd .Lsincosarray+0x50(%rip),%xmm4 # s6 + movapd .Lcossinarray+0x50(%rip),%xmm5 # s6 + movdqa .Lsincosarray+0x20(%rip),%xmm8 # s3 + movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + movhlps %xmm10,%xmm10 # move high x4 for cos term + + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin) + movhlps %xmm3,%xmm7 # move high x2 for x3 for sin term (sincos) + + mulsd %xmm0,%xmm6 # get low x3 for sin term + mulsd p_temp3+8(%rsp),%xmm7 # get high x3 for sin term + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high r for cos (cossin) + movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term (sincos) + + movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin) + movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos) + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm11,%xmm5 # cos *x4 + mulsd %xmm10,%xmm8 # cos *x4 + mulsd %xmm7,%xmm9 # sin *x3 + + mulsd p_temp(%rsp),%xmm2 # low 0.5 * x2 * xx for sin term (cossin) + mulsd p_temp1+8(%rsp),%xmm13 # high 0.5 * x2 * xx for sin term (sincos) + + movsd %xmm12,%xmm6 # Keep high r for cos term + movsd %xmm3,%xmm7 # Keep low r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0 + + subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx (cossin) + subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx (sincos) + + movhlps %xmm0,%xmm10 # move high x for x*xx for cos term (cossin) + movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos) + + mulsd p_temp+8(%rsp),%xmm10 # x * xx + mulsd p_temp1(%rsp),%xmm1 # x * xx + + movsd %xmm12,%xmm2 # move -t for cos term + movsd %xmm3,%xmm13 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm12 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t) + + addsd p_temp(%rsp),%xmm4 # sin+xx + + addsd p_temp1+8(%rsp),%xmm9 # sin+xx + + + subsd %xmm6,%xmm12 # (1-t) - r + subsd %xmm7,%xmm3 # (1-t) - r + + subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx + subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx + + addsd %xmm0,%xmm4 # sin + x + + addsd %xmm11,%xmm9 # sin + x + + + addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx) + addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx) + + subsd %xmm2,%xmm8 # cos+t + subsd %xmm13,%xmm5 # cos+t + + movlhps %xmm8,%xmm4 # cossin + movlhps %xmm9,%xmm5 # sincos + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lsincos_sincos_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + movapd %xmm0,p_temp2(%rsp) # Store r + movapd %xmm1,p_temp3(%rsp) # Store r + + + movapd .Lcossinarray+0x50(%rip),%xmm4 # s6 + movapd .Lcossinarray+0x50(%rip),%xmm5 # s6 + movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3 + movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term + movhlps %xmm3,%xmm7 # move low x2 for x3 for sin term + mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term + mulsd p_temp3+8(%rsp),%xmm7 # get low x3 for sin term + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term + movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term + # Reverse 12 and 2 + + movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos + movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos + + mulsd %xmm6,%xmm8 # sin *x3 + mulsd %xmm7,%xmm9 # sin *x3 + mulsd %xmm10,%xmm4 # cos *x4 + mulsd %xmm11,%xmm5 # cos *x4 + + mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term + mulsd p_temp1+8(%rsp),%xmm13 # 0.5 * x2 * xx for sin term + movsd %xmm2,%xmm6 # Keep high r for cos term + movsd %xmm3,%xmm7 # Keep high r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0 + + subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx + subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx + + movhlps %xmm0,%xmm10 # move high x for x for sin term + movhlps %xmm1,%xmm11 # move high x for x for sin term + # Reverse 10 and 0 + + mulsd p_temp(%rsp),%xmm0 # x * xx + mulsd p_temp1(%rsp),%xmm1 # x * xx + + movsd %xmm2,%xmm12 # move -t for cos term + movsd %xmm3,%xmm13 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t) + addsd p_temp+8(%rsp),%xmm8 # sin+xx + addsd p_temp1+8(%rsp),%xmm9 # sin+xx + + subsd %xmm6,%xmm2 # (1-t) - r + subsd %xmm7,%xmm3 # (1-t) - r + + subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx + subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx + + addsd %xmm10,%xmm8 # sin + x + addsd %xmm11,%xmm9 # sin + x + + addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx) + addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx) + + subsd %xmm12,%xmm4 # cos+t + subsd %xmm13,%xmm5 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lcossin_sincos_piby4: # changed from sincos_sincos + # xmm1 is cossin and xmm0 is sincos +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + movapd %xmm0,p_temp2(%rsp) # Store r + + + movapd .Lcossinarray+0x50(%rip),%xmm4 # s6 + movapd .Lsincosarray+0x50(%rip),%xmm5 # s6 + movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3 + movdqa .Lsincosarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + movhlps %xmm11,%xmm11 # move high x4 for cos term + + + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term + movsd %xmm3,%xmm7 # move low x2 for x3 for sin term + + mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term + mulsd %xmm1,%xmm7 # get low x3 for sin term + + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term + movhlps %xmm3,%xmm13 # move high r for cos + + movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos + movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin + + mulsd %xmm6,%xmm8 # sin *x3 + mulsd %xmm11,%xmm9 # cos *x4 + mulsd %xmm10,%xmm4 # cos *x4 + mulsd %xmm7,%xmm5 # sin *x3 + + mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term + mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term + + movsd %xmm2,%xmm6 # Keep high r for cos term + movsd %xmm13,%xmm7 # Keep high r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t=r-1.0 + + subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx + subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx + + movhlps %xmm0,%xmm10 # move high x for x for sin term + movhlps %xmm1,%xmm11 # move high x for x*xx for cos term + + mulsd p_temp(%rsp),%xmm0 # x * xx + mulsd p_temp1+8(%rsp),%xmm11 # x * xx + + movsd %xmm2,%xmm12 # move -t for cos term + movsd %xmm13,%xmm3 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm13 # 1+(-t) + + addsd p_temp+8(%rsp),%xmm8 # sin+xx + addsd p_temp1(%rsp),%xmm5 # sin+xx + + subsd %xmm6,%xmm2 # (1-t) - r + subsd %xmm7,%xmm13 # (1-t) - r + + subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx + subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx + + + addsd %xmm10,%xmm8 # sin + x + addsd %xmm1,%xmm5 # sin + x + + addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx) + addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx) + + subsd %xmm12,%xmm4 # cos+t + subsd %xmm3,%xmm9 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + jmp .L__vrd4_cos_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_sinsin_piby4: + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + movapd %xmm2,p_temp2(%rsp) # store x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + movapd %xmm11,p_temp3(%rsp) # store r + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + mulpd %xmm2,%xmm10 # x4 + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + + movapd %xmm2,%xmm12 # copy of x2 for 0.5*x2 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) + + mulpd %xmm2,%xmm10 # x6 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd .L__real_3fe0000000000000(%rip),%xmm12 # 0.5 *x2 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm0,%xmm2 # x3 recalculate + mulpd %xmm3,%xmm3 # x4 recalculate + + movapd p_temp3(%rsp),%xmm13 # r + + mulpd %xmm6,%xmm12 # 0.5 * x2 *xx + mulpd %xmm1,%xmm7 # x * xx + + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # x4 * zc + + subpd %xmm12,%xmm4 # -0.5 * x2 *xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm6,%xmm4 # x3 * zs +xx + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + + subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1 + addpd %xmm0,%xmm4 # +x + subpd %xmm13,%xmm5 # + t + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lsinsin_coscos_piby4: + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + movapd %xmm3,p_temp3(%rsp) # store x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + movapd %xmm10,p_temp2(%rsp) # store r + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + mulpd %xmm3,%xmm11 # x4 + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for 0.5*x2 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm11 # x6 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd .L__real_3fe0000000000000(%rip),%xmm13 # 0.5 *x2 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zs + + mulpd %xmm2,%xmm2 # x4 recalculate + mulpd %xmm1,%xmm3 # x3 recalculate + + movapd p_temp2(%rsp),%xmm12 # r + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm7,%xmm13 # 0.5 * x2 *xx + subpd %xmm12,%xmm10 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x3 * zs + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx;;;;;;;;;;;;;;;;;;;;; + subpd %xmm13,%xmm5 # -0.5 * x2 *xx + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm7,%xmm5 # +xx + subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1 + addpd %xmm1,%xmm5 # +x + subpd %xmm12,%xmm4 # + t + + jmp .L__vrd4_cos_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_cossin_piby4: #Derive from cossin_coscos + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movapd .Lsincosarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm6,p_temp(%rsp) # rr + movhlps %xmm10,%xmm10 # get upper r for t for cos + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + movsd %xmm0,%xmm8 # lower x for sin + mulsd %xmm2,%xmm8 # lower x3 for sin + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # upper x4 for cos + movsd %xmm8,%xmm2 # lower x3 for sin + + movsd %xmm6,%xmm9 # lower xx + # note using odd reg + + movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term + movapd p_temp3(%rsp),%xmm13 # r + + mulpd %xmm0,%xmm6 # x * xx for upper cos term + mulpd %xmm1,%xmm7 # x * xx + movhlps %xmm6,%xmm6 + mulsd p_temp2(%rsp),%xmm9 # xx * 0.5*x2 for sin term + + subsd %xmm12,%xmm10 # (1 + (-t)) - r + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + # x3 * zs + + movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin + + subsd %xmm9,%xmm4 # x3zs - 0.5*x2*xx + + subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp(%rsp),%xmm4 # +xx + + + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + subsd %xmm12,%xmm8 # + t + addsd %xmm0,%xmm4 # +x + subpd %xmm13,%xmm5 # + t + + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lcoscos_sincos_piby4: #Derive from sincos_coscos + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movapd .Lcossinarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm6,p_temp(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zszc + addpd %xmm9,%xmm5 # z + + mulpd %xmm0,%xmm2 # upper x3 for sin + mulsd %xmm0,%xmm2 # lower x4 for cos + mulpd %xmm3,%xmm3 # x4 + + movhlps %xmm6,%xmm9 # upper xx for sin term + # note using odd reg + + movlpd p_temp2(%rsp),%xmm12 # lower r for cos term + movapd p_temp3(%rsp),%xmm13 # r + + + mulpd %xmm0,%xmm6 # x * xx for lower cos term + mulpd %xmm1,%xmm7 # x * xx + + mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term + + subsd %xmm12,%xmm10 # (1 + (-t)) - r + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + mulpd %xmm3,%xmm5 + # x4 * zc + + movhlps %xmm4,%xmm8 # xmm8= sin, xmm4= cos + subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx + + + subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp+8(%rsp),%xmm8 # +xx + + movhlps %xmm0,%xmm0 # upper x for sin + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + subsd %xmm12,%xmm4 # + t + subpd %xmm13,%xmm5 # + t + addsd %xmm0,%xmm8 # +x + + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lcossin_coscos_piby4: + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsincosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + movhlps %xmm11,%xmm11 # get upper r for t for cos + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zcs + + movsd %xmm1,%xmm9 # lower x for sin + mulsd %xmm3,%xmm9 # lower x3 for sin + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # upper x4 for cos + movsd %xmm9,%xmm3 # lower x3 for sin + + movsd %xmm7,%xmm8 # lower xx + # note using even reg + + movapd p_temp2(%rsp),%xmm12 # r + movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm1,%xmm7 # x * xx for upper cos term + movhlps %xmm7,%xmm7 + mulsd p_temp3(%rsp),%xmm8 # xx * 0.5*x2 for sin term + + subpd %xmm12,%xmm10 # (1 + (-t)) - r + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + # x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin + + subsd %xmm8,%xmm5 # x3zs - 0.5*x2*xx + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1(%rsp),%xmm5 # +xx + + + subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1 + subsd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1 + + subpd %xmm12,%xmm4 # + t + subsd %xmm13,%xmm9 # + t + addsd %xmm1,%xmm5 # +x + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lcossin_sinsin_piby4: # Derived from sincos_sinsin + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lsincosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + + movhlps %xmm11,%xmm11 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm6,%xmm10 # 0.5*x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zczs + + movsd %xmm3,%xmm12 + mulsd %xmm1,%xmm12 # low x3 for sin + + mulpd %xmm0,%xmm2 # x3 + mulpd %xmm3,%xmm3 # high x4 for cos + movsd %xmm12,%xmm3 # low x3 for sin + + movhlps %xmm1,%xmm8 # upper x for cos term + # note using even reg + movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term + + mulsd p_temp1+8(%rsp),%xmm8 # x * xx for upper cos term + + mulsd p_temp3(%rsp),%xmm7 # xx * 0.5*x2 for lower sin term + + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin + + subsd %xmm7,%xmm5 # x3zs - 0.5*x2*xx + + subsd %xmm8,%xmm11 # ((1 + (-t)) - r) - x*xx + + subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx + addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1(%rsp),%xmm5 # +xx + + addpd %xmm6,%xmm4 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + + addsd %xmm1,%xmm5 # +x + addpd %xmm0,%xmm4 # +x + subsd %xmm13,%xmm9 # + t + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lsincos_coscos_piby4: + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lcossinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm1,%xmm3 # upper x3 for sin + mulsd %xmm1,%xmm3 # lower x4 for cos + + movhlps %xmm7,%xmm8 # upper xx for sin term + # note using even reg + + movapd p_temp2(%rsp),%xmm12 # r + movlpd p_temp3(%rsp),%xmm13 # lower r for cos term + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm1,%xmm7 # x * xx for lower cos term + + mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term + + subpd %xmm12,%xmm10 # (1 + (-t)) - r + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos + + subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1+8(%rsp),%xmm9 # +xx + + movhlps %xmm1,%xmm1 # upper x for sin + subpd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + subpd %xmm12,%xmm4 # + t + subsd %xmm13,%xmm5 # + t + addsd %xmm1,%xmm9 # +x + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_cos_cleanup + + +.align 16 +.Lsincos_sinsin_piby4: # Derived from sincos_coscos + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lcossinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm6,%xmm10 # 0.5x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm0,%xmm2 # x3 + mulpd %xmm1,%xmm3 # upper x3 for sin + mulsd %xmm1,%xmm3 # lower x4 for cos + + movhlps %xmm7,%xmm8 # upper xx for sin term + # note using even reg + + movlpd p_temp3(%rsp),%xmm13 # lower r for cos term + + mulpd %xmm1,%xmm7 # x * xx for lower cos term + + mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term + + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos + + subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx + + subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx + addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1+8(%rsp),%xmm9 # +xx + + movhlps %xmm1,%xmm1 # upper x for sin + addpd %xmm6,%xmm4 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + addsd %xmm1,%xmm9 # +x + addpd %xmm0,%xmm4 # +x + subsd %xmm13,%xmm5 # + t + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_cos_cleanup + + +.align 16 +.Lsinsin_cossin_piby4: # Derived from sincos_sinsin + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lsincosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # x2 + movapd %xmm6,p_temp(%rsp) # xx + + movhlps %xmm10,%xmm10 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm7,%xmm11 # 0.5*x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zs + + + movsd %xmm2,%xmm13 + mulsd %xmm0,%xmm13 # low x3 for sin + + mulpd %xmm1,%xmm3 # x3 + mulpd %xmm2,%xmm2 # high x4 for cos + movsd %xmm13,%xmm2 # low x3 for sin + + + movhlps %xmm0,%xmm9 # upper x for cos term ; note using even reg + movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term + mulsd p_temp+8(%rsp),%xmm9 # x * xx for upper cos term + mulsd p_temp2(%rsp),%xmm6 # xx * 0.5*x2 for lower sin term + subsd %xmm12,%xmm10 # (1 + (-t)) - r + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin + subsd %xmm6,%xmm4 # x3zs - 0.5*x2*xx + + subsd %xmm9,%xmm10 # ((1 + (-t)) - r) - x*xx + + subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx + + addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp(%rsp),%xmm4 # +xx + + addpd %xmm7,%xmm5 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + + addsd %xmm0,%xmm4 # +x + addpd %xmm1,%xmm5 # +x + subsd %xmm12,%xmm8 # + t + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lsinsin_sincos_piby4: # Derived from sincos_coscos + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcossinarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm6,p_temp(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm7,%xmm11 # 0.5x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm1,%xmm3 # x3 + mulpd %xmm0,%xmm2 # upper x3 for sin + mulsd %xmm0,%xmm2 # lower x4 for cos + + movhlps %xmm6,%xmm9 # upper xx for sin term + # note using even reg + + movlpd p_temp2(%rsp),%xmm12 # lower r for cos term + + mulpd %xmm0,%xmm6 # x * xx for lower cos term + + mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term + + subsd %xmm12,%xmm10 # (1 + (-t)) - r + + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm4,%xmm8 # xmm9= sin, xmm5= cos + + subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx + + subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + + subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx + addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp+8(%rsp),%xmm8 # +xx + + movhlps %xmm0,%xmm0 # upper x for sin + addpd %xmm7,%xmm5 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + + + addsd %xmm0,%xmm8 # +x + addpd %xmm1,%xmm5 # +x + subsd %xmm12,%xmm4 # + t + + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_cos_cleanup + + +.align 16 +.Lsinsin_sinsin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + movapd %xmm2,p_temp2(%rsp) # copy of x2 + movapd %xmm3,p_temp3(%rsp) # copy of x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm2,%xmm10 # x6 + mulpd %xmm3,%xmm11 # x6 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5 *x2 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm6,%xmm2 # 0.5 * x2 *xx + mulpd %xmm7,%xmm3 # 0.5 * x2 *xx + + mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zs + + movapd p_temp2(%rsp),%xmm10 # x2 + movapd p_temp3(%rsp),%xmm11 # x2 + + mulpd %xmm0,%xmm10 # x3 + mulpd %xmm1,%xmm11 # x3 + + mulpd %xmm10,%xmm4 # x3 * zs + mulpd %xmm11,%xmm5 # x3 * zs + + subpd %xmm2,%xmm4 # -0.5 * x2 *xx + subpd %xmm3,%xmm5 # -0.5 * x2 *xx + + addpd %xmm6,%xmm4 # +xx + addpd %xmm7,%xmm5 # +xx + + addpd %xmm0,%xmm4 # +x + addpd %xmm1,%xmm5 # +x + + jmp .L__vrd4_cos_cleanup
diff --git a/src/gas/vrd4exp.S b/src/gas/vrd4exp.S new file mode 100644 index 0000000..a05af8b --- /dev/null +++ b/src/gas/vrd4exp.S
@@ -0,0 +1,502 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrd4exp.s +# +# A vector implementation of the exp libm function. +# +# Prototype: +# +# __m128d,__m128d __vrd4_exp(__m128d x1, __m128d x2); +# +# Computes e raised to the x power for an array of input values. +# Places the results into the supplied y array. +# Does not perform error checking. Denormal results are truncated to 0. +# +# This routine computes 4 double precision exponent values at a time. +# The four values are passed as packed doubles in xmm0 and xmm1. +# The four results are returned as packed doubles in xmm0 and xmm1. +# Note that this represents a non-standard ABI usage, as no ABI +# ( and indeed C) currently allows returning 2 values for a function. +# It is expected that some compilers may be able to take advantage of this +# interface when implementing vectorized loops. Using the array implementation +# of the routine requires putting the inputs into memory, and retrieving +# the results from memory. This routine eliminates the need for this +# overhead if the data does not already reside in memory. +# This routine is derived directly from the array version. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# define local variable storage offsets +.equ p_temp,0 # temporary for get/put bits operation +.equ p_temp1,0x10 # temporary for exponent multiply + +.equ save_rbx,0x020 #qword +.equ save_rdi,0x028 #qword + +.equ save_rsi,0x030 #qword + + + +.equ p2_temp,0x40 # second temporary for get/put bits operation +.equ p2_temp1,0x60 # second temporary for exponent multiply + + +.equ stack_size,0x088 + + +# parameters are passed in by Linux as: +# xmm0 - __m128d x1 +# xmm1 - __m128d x2 + + .text + .align 16 + .p2align 4,,15 +.globl __vrd4_exp + .type __vrd4_exp,@function +__vrd4_exp: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rdi + movapd %xmm1,%xmm6 + +# process 4 values at a time. + + movapd .L__real_thirtytwo_by_log2(%rip),%xmm3 # + +# Step 1. Reduce the argument. +# /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */ +# r = x * thirtytwo_by_logbaseof2; + movapd %xmm3,%xmm7 + movapd %xmm0,p_temp(%rsp) + maxpd .L__real_C0F0000000000000(%rip),%xmm0 # protect against very large negative, non-infinite numbers + mulpd %xmm0,%xmm3 + + movapd %xmm6,p2_temp(%rsp) + maxpd .L__real_C0F0000000000000(%rip),%xmm6 + mulpd %xmm6,%xmm7 + +# save x for later. + minpd .L__real_40F0000000000000(%rip),%xmm3 # protect against very large, non-infinite numbers + + +# /* Set n = nearest integer to r */ + cvtpd2dq %xmm3,%xmm4 + lea .L__two_to_jby32_lead_table(%rip),%rdi + lea .L__two_to_jby32_trail_table(%rip),%rsi + cvtdq2pd %xmm4,%xmm1 + minpd .L__real_40F0000000000000(%rip),%xmm7 # protect against very large, non-infinite numbers + + # r1 = x - n * logbaseof2_by_32_lead; + movapd .L__real_log2_by_32_lead(%rip),%xmm2 # + mulpd %xmm1,%xmm2 # + movq %xmm4,p_temp1(%rsp) + subpd %xmm2,%xmm0 # r1 in xmm0, + + cvtpd2dq %xmm7,%xmm2 + cvtdq2pd %xmm2,%xmm8 + +# r2 = - n * logbaseof2_by_32_trail; + mulpd .L__real_log2_by_32_tail(%rip),%xmm1 # r2 in xmm1 +# j = n & 0x0000001f; + mov $0x01f,%r9 + mov %r9,%r8 + mov p_temp1(%rsp),%ecx + and %ecx,%r9d + movq %xmm2,p2_temp1(%rsp) + movapd .L__real_log2_by_32_lead(%rip),%xmm9 + mulpd %xmm8,%xmm9 + subpd %xmm9,%xmm6 # r1b in xmm6 + mulpd .L__real_log2_by_32_tail(%rip),%xmm8 # r2b in xmm8 + + mov p_temp1+4(%rsp),%edx + and %edx,%r8d +# f1 = two_to_jby32_lead_table[j]; +# f2 = two_to_jby32_trail_table[j]; + +# *m = (n - j) / 32; + sub %r9d,%ecx + sar $5,%ecx #m + sub %r8d,%edx + sar $5,%edx + + + movapd %xmm0,%xmm2 + addpd %xmm1,%xmm2 # r = r1 + r2 + + mov $0x01f,%r11 + mov %r11,%r10 + mov p2_temp1(%rsp),%ebx + and %ebx,%r11d +# Step 2. Compute the polynomial. +# q = r1 + (r2 + +# r*r*( 5.00000000000000008883e-01 + +# r*( 1.66666666665260878863e-01 + +# r*( 4.16666666662260795726e-02 + +# r*( 8.33336798434219616221e-03 + +# r*( 1.38889490863777199667e-03 )))))); +# q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720 + movapd %xmm2,%xmm1 + movapd .L__real_3f56c1728d739765(%rip),%xmm3 # 1/720 + movapd .L__real_3FC5555555548F7C(%rip),%xmm0 # 1/6 +# deal with infinite results + mov $1024,%rax + movsx %ecx,%rcx + cmp %rax,%rcx + + mulpd %xmm2,%xmm3 # *x + mulpd %xmm2,%xmm0 # *x + mulpd %xmm2,%xmm1 # x*x + movapd %xmm1,%xmm4 + + cmovg %rax,%rcx ## if infinite, then set rcx to multiply + # by infinity + movsx %edx,%rdx + cmp %rax,%rdx + + movapd %xmm6,%xmm9 + addpd %xmm8,%xmm9 # rb = r1b + r2b + addpd .L__real_3F811115B7AA905E(%rip),%xmm3 # + 1/120 + addpd .L__real_3fe0000000000000(%rip),%xmm0 # + .5 + mulpd %xmm1,%xmm4 # x^4 + mulpd %xmm2,%xmm3 # *x + + cmovg %rax,%rdx ## if infinite, then set rcx to multiply + # by infinity +# deal with denormal results + xor %rax,%rax + add $1023,%rcx # add bias + + mulpd %xmm1,%xmm0 # *x^2 + addpd .L__real_3FA5555555545D4E(%rip),%xmm3 # + 1/24 + addpd %xmm2,%xmm0 # + x + mulpd %xmm4,%xmm3 # *x^4 + +# check for infinity or nan + movapd p_temp(%rsp),%xmm2 + + cmovs %rax,%rcx ## if denormal, then multiply by 0 + shl $52,%rcx # build 2^n + + sub %r11d,%ebx + movapd %xmm9,%xmm1 + addpd %xmm3,%xmm0 # q = final sum + movapd .L__real_3f56c1728d739765(%rip),%xmm7 # 1/720 + movapd .L__real_3FC5555555548F7C(%rip),%xmm3 # 1/6 + +# *z2 = f2 + ((f1 + f2) * q); + movlpd (%rsi,%r9,8),%xmm5 # f2 + movlpd (%rsi,%r8,8),%xmm4 # f2 + addsd (%rdi,%r8,8),%xmm4 # f1 + f2 + addsd (%rdi,%r9,8),%xmm5 # f1 + f2 + mov p2_temp1+4(%rsp),%r8d + and %r8d,%r10d + sar $5,%ebx #m + mulpd %xmm9,%xmm7 # *x + mulpd %xmm9,%xmm3 # *x + mulpd %xmm9,%xmm1 # x*x + sub %r10d,%r8d + sar $5,%r8d +# check for infinity or nan + andpd .L__real_infinity(%rip),%xmm2 + cmppd $0,.L__real_infinity(%rip),%xmm2 + add $1023,%rdx # add bias + shufpd $0,%xmm4,%xmm5 + movapd %xmm1,%xmm4 + + cmovs %rax,%rdx ## if denormal, then multiply by 0 + shl $52,%rdx # build 2^n + + mulpd %xmm5,%xmm0 + mov %rcx,p_temp1(%rsp) # get 2^n to memory + mov %rdx,p_temp1+8(%rsp) # get 2^n to memory + addpd %xmm5,%xmm0 #z = z1 + z2 + mov $1024,%rax + movsx %ebx,%rbx + cmp %rax,%rbx +# end of splitexp +# /* Scale (z1 + z2) by 2.0**m */ +# r = scaleDouble_1(z, n); + + + cmovg %rax,%rbx ## if infinite, then set rcx to multiply + # by infinity + movsx %r8d,%rdx + cmp %rax,%rdx + + movmskpd %xmm2,%r8d + + addpd .L__real_3F811115B7AA905E(%rip),%xmm7 # + 1/120 + addpd .L__real_3fe0000000000000(%rip),%xmm3 # + .5 + mulpd %xmm1,%xmm4 # x^4 + mulpd %xmm9,%xmm7 # *x + cmovg %rax,%rdx ## if infinite, then set rcx to multiply + + + xor %rax,%rax + add $1023,%rbx # add bias + + mulpd %xmm1,%xmm3 # *x^2 + addpd .L__real_3FA5555555545D4E(%rip),%xmm7 # + 1/24 + addpd %xmm9,%xmm3 # + x + mulpd %xmm4,%xmm7 # *x^4 + + cmovs %rax,%rbx ## if denormal, then multiply by 0 + shl $52,%rbx # build 2^n + +# Step 3. Reconstitute. + + mulpd p_temp1(%rsp),%xmm0 # result *= 2^n + addpd %xmm7,%xmm3 # q = final sum + + movlpd (%rsi,%r11,8),%xmm5 # f2 + movlpd (%rsi,%r10,8),%xmm4 # f2 + addsd (%rdi,%r10,8),%xmm4 # f1 + f2 + addsd (%rdi,%r11,8),%xmm5 # f1 + f2 + + add $1023,%rdx # add bias + cmovs %rax,%rdx ## if denormal, then multiply by 0 + shufpd $0,%xmm4,%xmm5 + shl $52,%rdx # build 2^n + + mulpd %xmm5,%xmm3 + mov %rbx,p2_temp1(%rsp) # get 2^n to memory + mov %rdx,p2_temp1+8(%rsp) # get 2^n to memory + addpd %xmm5,%xmm3 #z = z1 + z2 + + movapd p2_temp(%rsp),%xmm2 + andpd .L__real_infinity(%rip),%xmm2 + cmppd $0,.L__real_infinity(%rip),%xmm2 + movmskpd %xmm2,%ebx + test $3,%r8d + mulpd p2_temp1(%rsp),%xmm3 # result *= 2^n +# we'd like to avoid a branch, and can use cmp's and and's to +# eliminate them. But it adds cycles for normal cases which +# are supposed to be exceptions. Using this branch with the +# check above results in faster code for the normal cases. + jnz .L__exp_naninf + +.L__vda_bottom1: +# store the result _m128d + test $3,%ebx + jnz .L__exp_naninf2 + +.L__vda_bottom2: + + movapd %xmm3,%xmm1 + + +# +# +.L__final_check: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + +# at least one of the numbers needs special treatment +.L__exp_naninf: + lea p_temp(%rsp),%rcx + call .L__naninf + jmp .L__vda_bottom1 +.L__exp_naninf2: + lea p2_temp(%rsp),%rcx + mov %ebx,%r8d + movapd %xmm0,%xmm4 + movapd %xmm3,%xmm0 + call .L__naninf + movapd %xmm0,%xmm3 + movapd %xmm4,%xmm0 + jmp .L__vda_bottom2 + +# This subroutine checks a double pair for nans and infinities and +# produces the proper result from the exceptional inputs +# Register assumptions: +# Inputs: +# r8d - mask of errors +# xmm0 - computed result vector +# rcx - pointing to memory image of inputs +# Outputs: +# xmm0 - new result vector +# %rax,rdx,,%xmm2 all modified. +.L__naninf: +# check the first number + test $1,%r8d + jz .L__check2 + + mov (%rcx),%rdx + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__enan1 # jump if mantissa not zero, so it's a NaN +# inf + mov %rdx,%rax + rcl $1,%rax + jnc .L__r1 # exp(+inf) = inf + xor %rdx,%rdx # exp(-inf) = 0 + jmp .L__r1 + +#NaN +.L__enan1: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__r1: + movd %rdx,%xmm2 + shufpd $2,%xmm0,%xmm2 + movsd %xmm2,%xmm0 +# check the second number +.L__check2: + test $2,%r8d + jz .L__r3 + mov 8(%rcx),%rdx + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__enan2 # jump if mantissa not zero, so it's a NaN +# inf + mov %rdx,%rax + rcl $1,%rax + jnc .L__r2 # exp(+inf) = inf + xor %rdx,%rdx # exp(-inf) = 0 + jmp .L__r2 + +#NaN +.L__enan2: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__r2: + movd %rdx,%xmm2 + shufpd $0,%xmm2,%xmm0 +.L__r3: + ret + + .data + .align 64 + +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_4040000000000000: .quad 0x04040000000000000 # 32 + .quad 0x04040000000000000 +.L__real_40F0000000000000: .quad 0x040F0000000000000 # 65536, to protect against really large numbers + .quad 0x040F0000000000000 +.L__real_C0F0000000000000: .quad 0x0C0F0000000000000 # -65536, to protect against really large negative numbers + .quad 0x0C0F0000000000000 +.L__real_3FA0000000000000: .quad 0x03FA0000000000000 # 1/32 + .quad 0x03FA0000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 1/2 + .quad 0x03fe0000000000000 +.L__real_infinity: .quad 0x07ff0000000000000 # + .quad 0x07ff0000000000000 +.L__real_ninfinity: .quad 0x0fff0000000000000 # + .quad 0x0fff0000000000000 +.L__real_thirtytwo_by_log2: .quad 0x040471547652b82fe # thirtytwo_by_log2 + .quad 0x040471547652b82fe +.L__real_log2_by_32_lead: .quad 0x03f962e42fe000000 # log2_by_32_lead + .quad 0x03f962e42fe000000 +.L__real_log2_by_32_tail: .quad 0x0Bdcf473de6af278e # -log2_by_32_tail + .quad 0x0Bdcf473de6af278e +.L__real_3f56c1728d739765: .quad 0x03f56c1728d739765 # 1.38889490863777199667e-03 + .quad 0x03f56c1728d739765 +.L__real_3F811115B7AA905E: .quad 0x03F811115B7AA905E # 8.33336798434219616221e-03 + .quad 0x03F811115B7AA905E +.L__real_3FA5555555545D4E: .quad 0x03FA5555555545D4E # 4.16666666662260795726e-02 + .quad 0x03FA5555555545D4E +.L__real_3FC5555555548F7C: .quad 0x03FC5555555548F7C # 1.66666666665260878863e-01 + .quad 0x03FC5555555548F7C + + +.L__two_to_jby32_lead_table: + .quad 0x03ff0000000000000 # 1 + .quad 0x03ff059b0d0000000 # 1.0219 + .quad 0x03ff0b55860000000 # 1.04427 + .quad 0x03ff11301d0000000 # 1.06714 + .quad 0x03ff172b830000000 # 1.09051 + .quad 0x03ff1d48730000000 # 1.11439 + .quad 0x03ff2387a60000000 # 1.13879 + .quad 0x03ff29e9df0000000 # 1.16372 + .quad 0x03ff306fe00000000 # 1.18921 + .quad 0x03ff371a730000000 # 1.21525 + .quad 0x03ff3dea640000000 # 1.24186 + .quad 0x03ff44e0860000000 # 1.26905 + .quad 0x03ff4bfdad0000000 # 1.29684 + .quad 0x03ff5342b50000000 # 1.32524 + .quad 0x03ff5ab07d0000000 # 1.35426 + .quad 0x03ff6247eb0000000 # 1.38391 + .quad 0x03ff6a09e60000000 # 1.41421 + .quad 0x03ff71f75e0000000 # 1.44518 + .quad 0x03ff7a11470000000 # 1.47683 + .quad 0x03ff8258990000000 # 1.50916 + .quad 0x03ff8ace540000000 # 1.54221 + .quad 0x03ff93737b0000000 # 1.57598 + .quad 0x03ff9c49180000000 # 1.61049 + .quad 0x03ffa5503b0000000 # 1.64576 + .quad 0x03ffae89f90000000 # 1.68179 + .quad 0x03ffb7f76f0000000 # 1.71862 + .quad 0x03ffc199bd0000000 # 1.75625 + .quad 0x03ffcb720d0000000 # 1.79471 + .quad 0x03ffd5818d0000000 # 1.83401 + .quad 0x03ffdfc9730000000 # 1.87417 + .quad 0x03ffea4afa0000000 # 1.91521 + .quad 0x03fff507650000000 # 1.95714 + .quad 0 # for alignment +.L__two_to_jby32_trail_table: + .quad 0x00000000000000000 # 0 + .quad 0x03e48ac2ba1d73e2a # 1.1489e-008 + .quad 0x03e69f3121ec53172 # 4.83347e-008 + .quad 0x03df25b50a4ebbf1b # 2.67125e-010 + .quad 0x03e68faa2f5b9bef9 # 4.65271e-008 + .quad 0x03e368b9aa7805b80 # 5.24924e-009 + .quad 0x03e6ceac470cd83f6 # 5.38622e-008 + .quad 0x03e547f7b84b09745 # 1.90902e-008 + .quad 0x03e64636e2a5bd1ab # 3.79764e-008 + .quad 0x03e5ceaa72a9c5154 # 2.69307e-008 + .quad 0x03e682468446b6824 # 4.49684e-008 + .quad 0x03e18624b40c4dbd0 # 1.41933e-009 + .quad 0x03e54d8a89c750e5e # 1.94147e-008 + .quad 0x03e5a753e077c2a0f # 2.46409e-008 + .quad 0x03e6a90a852b19260 # 4.94813e-008 + .quad 0x03e0d2ac258f87d03 # 8.48872e-010 + .quad 0x03e59fcef32422cbf # 2.42032e-008 + .quad 0x03e61d8bee7ba46e2 # 3.3242e-008 + .quad 0x03e4f580c36bea881 # 1.45957e-008 + .quad 0x03e62999c25159f11 # 3.46453e-008 + .quad 0x03e415506dadd3e2a # 8.0709e-009 + .quad 0x03e29b8bc9e8a0388 # 2.99439e-009 + .quad 0x03e451f8480e3e236 # 9.83622e-009 + .quad 0x03e41f12ae45a1224 # 8.35492e-009 + .quad 0x03e62b5a75abd0e6a # 3.48493e-008 + .quad 0x03e47daf237553d84 # 1.11085e-008 + .quad 0x03e6b0aa538444196 # 5.03689e-008 + .quad 0x03e69df20d22a0798 # 4.81896e-008 + .quad 0x03e69f7490e4bb40b # 4.83654e-008 + .quad 0x03e4bdcdaf5cb4656 # 1.29746e-008 + .quad 0x03e452486cc2c7b9d # 9.84533e-009 + .quad 0x03e66dc8a80ce9f09 # 4.25828e-008 + .quad 0 # for alignment + + +
diff --git a/src/gas/vrd4frcpa.S b/src/gas/vrd4frcpa.S new file mode 100644 index 0000000..3ae0b91 --- /dev/null +++ b/src/gas/vrd4frcpa.S
@@ -0,0 +1,1181 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrd4frcpa.asm +# +# A vector implementation of the floating point reciprocal approximation function. +# The goal is to be faster than a divide. This routine provides four double +# precision results from four double precision inputs. It would not be necessary +## if SSE defined a double precision instruction similar to the single precision +# rcpss. +# +# Prototype: +# +# __m128d,__m128d __vrd4_frcpa(__m128d x1, __m128d x2); +# +# Computes an approximate reciprocal of x. +# A table lookup is performed on the higher 10 bits of the mantissa +# (not including the implicit bit). +# +# +# +# This routine computes 4 double precision frcpa values at a time. +# The four values are passed as packed doubles in xmm0 and xmm1. +# The four results are returned as packed doubles in xmm0 and xmm1. +# Note that this represents a non-standard ABI usage, as no ABI +# ( and indeed C) currently allows returning 2 values for a function. +# It is expected that some compilers may be able to take advantage of this +# interface when implementing vectorized loops. +# + + +# define local variable storage offsets +.equ p_x,0 # temporary for get/put bits operation +.equ p_x2,0x10 # temporary for get/put bits operation + +.equ stack_size,0x028 +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + +# parameters are expected as: +# xmm0 - __m128d x1 +# xmm1 - __m128d x2 + + .text + .align 16 + .p2align 4,,15 +.globl __vrd4_frcpa + .type __vrd4_frcpa,@function +__vrd4_frcpa: + sub $stack_size,%rsp +# 10 bit GPR method + xor %rax,%rax + movdqa .L__mask_expext(%rip),%xmm3 + movdqa %xmm1,%xmm6 + movdqa %xmm0,%xmm4 + movdqa %xmm3,%xmm5 +## if 1/2 bit set, increment the index+exponent + psrlq $41,%xmm4 + psrlq $41,%xmm6 + movdqa %xmm4,%xmm2 + paddq .L__int_one(%rip),%xmm4 + psrlq $1,%xmm4 + pand .L__mask_10bits(%rip),%xmm4 +# invert the exponent + psubq %xmm2,%xmm3 + movdqa %xmm6,%xmm2 + paddq .L__int_one(%rip),%xmm6 + psrlq $1,%xmm6 + pand .L__mask_10bits(%rip),%xmm6 + psubq %xmm2,%xmm5 + pand .L__mask_expext2(%rip),%xmm3 + pand .L__mask_expext2(%rip),%xmm5 + psllq $1,%xmm3 +# do the lookup and recombine + lea .L__rcp_table(%rip),%rdx + + movdqa %xmm4,p_x(%rsp) # move the indexes to a memory location + psllq $1,%xmm5 + mov p_x(%rsp),%r8 # 3 cycles faster for frcpa, but 2 cycles slower for log + mov p_x+8(%rsp),%r9 + movdqa %xmm6,p_x2(%rsp) # move the indexes to a memory location + movd (%rdx,%r9,4),%xmm2 # lookup + movd (%rdx,%r8,4),%xmm4 # lookup + pslldq $8,%xmm2 # shift by 8 bytes + por %xmm4,%xmm2 + por %xmm2,%xmm3 + mov p_x2(%rsp),%r8 # 3 cycles faster for frcpa, but 2 cycles slower for log + mov p_x2+8(%rsp),%r9 + movd (%rdx,%r9,4),%xmm2 # lookup + movd (%rdx,%r8,4),%xmm4 # lookup + pslldq $8,%xmm2 # shift by 8 bytes + por %xmm4,%xmm2 + por %xmm2,%xmm5 +# shift and restore the sign + pand .L__mask_sign(%rip),%xmm0 + pand .L__mask_sign(%rip),%xmm1 + psllq $40,%xmm3 + psllq $40,%xmm5 + por %xmm3,%xmm0 + por %xmm5,%xmm1 + add $stack_size,%rsp + ret + + + .data + .align 16 + +.L__int_one: .quad 0x00000000000000001 + .quad 0x00000000000000001 + +.L__mask_10bits: .quad 0x000000000000003ff + .quad 0x000000000000003ff + +.L__mask_expext: .quad 0x000000000003ff000 + .quad 0x000000000003ff000 + +.L__mask_expext2: .quad 0x000000000003ff800 + .quad 0x000000000003ff800 + +.L__mask_sign: .quad 0x08000000000000000 + .quad 0x08000000000000000 + +.L__real_one: .quad 0x03ff0000000000000 + .quad 0x03ff0000000000000 +.L__real_two: .quad 0x04000000000000000 + .quad 0x04000000000000000 + + .align 16 + +.L__rcp_table: + .long 0x0000 + .long 0x0FF8 + .long 0x0FF0 + .long 0x0FE8 + .long 0x0FE0 + .long 0x0FD8 + .long 0x0FD0 + .long 0x0FC8 + .long 0x0FC0 + .long 0x0FB8 + .long 0x0FB1 + .long 0x0FA9 + .long 0x0FA1 + .long 0x0F99 + .long 0x0F91 + .long 0x0F89 + .long 0x0F82 + .long 0x0F7A + .long 0x0F72 + .long 0x0F6B + .long 0x0F63 + .long 0x0F5B + .long 0x0F53 + .long 0x0F4C + .long 0x0F44 + .long 0x0F3D + .long 0x0F35 + .long 0x0F2D + .long 0x0F26 + .long 0x0F1E + .long 0x0F17 + .long 0x0F0F + .long 0x0F08 + .long 0x0F00 + .long 0x0EF8 + .long 0x0EF1 + .long 0x0EEA + .long 0x0EE2 + .long 0x0EDB + .long 0x0ED3 + .long 0x0ECC + .long 0x0EC4 + .long 0x0EBD + .long 0x0EB6 + .long 0x0EAE + .long 0x0EA7 + .long 0x0EA0 + .long 0x0E98 + .long 0x0E91 + .long 0x0E8A + .long 0x0E82 + .long 0x0E7B + .long 0x0E74 + .long 0x0E6D + .long 0x0E65 + .long 0x0E5E + .long 0x0E57 + .long 0x0E50 + .long 0x0E49 + .long 0x0E41 + .long 0x0E3A + .long 0x0E33 + .long 0x0E2C + .long 0x0E25 + .long 0x0E1E + .long 0x0E17 + .long 0x0E10 + .long 0x0E09 + .long 0x0E02 + .long 0x0DFB + .long 0x0DF4 + .long 0x0DED + .long 0x0DE6 + .long 0x0DDF + .long 0x0DD8 + .long 0x0DD1 + .long 0x0DCA + .long 0x0DC3 + .long 0x0DBC + .long 0x0DB5 + .long 0x0DAE + .long 0x0DA7 + .long 0x0DA0 + .long 0x0D9A + .long 0x0D93 + .long 0x0D8C + .long 0x0D85 + .long 0x0D7E + .long 0x0D77 + .long 0x0D71 + .long 0x0D6A + .long 0x0D63 + .long 0x0D5C + .long 0x0D56 + .long 0x0D4F + .long 0x0D48 + .long 0x0D42 + .long 0x0D3B + .long 0x0D34 + .long 0x0D2E + .long 0x0D27 + .long 0x0D20 + .long 0x0D1A + .long 0x0D13 + .long 0x0D0C + .long 0x0D06 + .long 0x0CFF + .long 0x0CF9 + .long 0x0CF2 + .long 0x0CEC + .long 0x0CE5 + .long 0x0CDF + .long 0x0CD8 + .long 0x0CD2 + .long 0x0CCB + .long 0x0CC5 + .long 0x0CBE + .long 0x0CB8 + .long 0x0CB1 + .long 0x0CAB + .long 0x0CA4 + .long 0x0C9E + .long 0x0C98 + .long 0x0C91 + .long 0x0C8B + .long 0x0C85 + .long 0x0C7E + .long 0x0C78 + .long 0x0C72 + .long 0x0C6B + .long 0x0C65 + .long 0x0C5F + .long 0x0C58 + .long 0x0C52 + .long 0x0C4C + .long 0x0C46 + .long 0x0C3F + .long 0x0C39 + .long 0x0C33 + .long 0x0C2D + .long 0x0C26 + .long 0x0C20 + .long 0x0C1A + .long 0x0C14 + .long 0x0C0E + .long 0x0C08 + .long 0x0C02 + .long 0x0BFB + .long 0x0BF5 + .long 0x0BEF + .long 0x0BE9 + .long 0x0BE3 + .long 0x0BDD + .long 0x0BD7 + .long 0x0BD1 + .long 0x0BCB + .long 0x0BC5 + .long 0x0BBF + .long 0x0BB9 + .long 0x0BB3 + .long 0x0BAD + .long 0x0BA7 + .long 0x0BA1 + .long 0x0B9B + .long 0x0B95 + .long 0x0B8F + .long 0x0B89 + .long 0x0B83 + .long 0x0B7D + .long 0x0B77 + .long 0x0B71 + .long 0x0B6C + .long 0x0B66 + .long 0x0B60 + .long 0x0B5A + .long 0x0B54 + .long 0x0B4E + .long 0x0B48 + .long 0x0B43 + .long 0x0B3D + .long 0x0B37 + .long 0x0B31 + .long 0x0B2B + .long 0x0B26 + .long 0x0B20 + .long 0x0B1A + .long 0x0B14 + .long 0x0B0F + .long 0x0B09 + .long 0x0B03 + .long 0x0AFE + .long 0x0AF8 + .long 0x0AF2 + .long 0x0AED + .long 0x0AE7 + .long 0x0AE1 + .long 0x0ADC + .long 0x0AD6 + .long 0x0AD0 + .long 0x0ACB + .long 0x0AC5 + .long 0x0AC0 + .long 0x0ABA + .long 0x0AB4 + .long 0x0AAF + .long 0x0AA9 + .long 0x0AA4 + .long 0x0A9E + .long 0x0A99 + .long 0x0A93 + .long 0x0A8E + .long 0x0A88 + .long 0x0A83 + .long 0x0A7D + .long 0x0A78 + .long 0x0A72 + .long 0x0A6D + .long 0x0A67 + .long 0x0A62 + .long 0x0A5C + .long 0x0A57 + .long 0x0A52 + .long 0x0A4C + .long 0x0A47 + .long 0x0A41 + .long 0x0A3C + .long 0x0A37 + .long 0x0A31 + .long 0x0A2C + .long 0x0A27 + .long 0x0A21 + .long 0x0A1C + .long 0x0A17 + .long 0x0A11 + .long 0x0A0C + .long 0x0A07 + .long 0x0A01 + .long 0x09FC + .long 0x09F7 + .long 0x09F2 + .long 0x09EC + .long 0x09E7 + .long 0x09E2 + .long 0x09DD + .long 0x09D7 + .long 0x09D2 + .long 0x09CD + .long 0x09C8 + .long 0x09C3 + .long 0x09BD + .long 0x09B8 + .long 0x09B3 + .long 0x09AE + .long 0x09A9 + .long 0x09A4 + .long 0x099E + .long 0x0999 + .long 0x0994 + .long 0x098F + .long 0x098A + .long 0x0985 + .long 0x0980 + .long 0x097B + .long 0x0976 + .long 0x0971 + .long 0x096C + .long 0x0967 + .long 0x0962 + .long 0x095C + .long 0x0957 + .long 0x0952 + .long 0x094D + .long 0x0948 + .long 0x0943 + .long 0x093E + .long 0x0939 + .long 0x0935 + .long 0x0930 + .long 0x092B + .long 0x0926 + .long 0x0921 + .long 0x091C + .long 0x0917 + .long 0x0912 + .long 0x090D + .long 0x0908 + .long 0x0903 + .long 0x08FE + .long 0x08FA + .long 0x08F5 + .long 0x08F0 + .long 0x08EB + .long 0x08E6 + .long 0x08E1 + .long 0x08DC + .long 0x08D8 + .long 0x08D3 + .long 0x08CE + .long 0x08C9 + .long 0x08C4 + .long 0x08C0 + .long 0x08BB + .long 0x08B6 + .long 0x08B1 + .long 0x08AC + .long 0x08A8 + .long 0x08A3 + .long 0x089E + .long 0x089A + .long 0x0895 + .long 0x0890 + .long 0x088B + .long 0x0887 + .long 0x0882 + .long 0x087D + .long 0x0879 + .long 0x0874 + .long 0x086F + .long 0x086B + .long 0x0866 + .long 0x0861 + .long 0x085D + .long 0x0858 + .long 0x0853 + .long 0x084F + .long 0x084A + .long 0x0846 + .long 0x0841 + .long 0x083C + .long 0x0838 + .long 0x0833 + .long 0x082F + .long 0x082A + .long 0x0825 + .long 0x0821 + .long 0x081C + .long 0x0818 + .long 0x0813 + .long 0x080F + .long 0x080A + .long 0x0806 + .long 0x0801 + .long 0x07FD + .long 0x07F8 + .long 0x07F4 + .long 0x07EF + .long 0x07EB + .long 0x07E6 + .long 0x07E2 + .long 0x07DD + .long 0x07D9 + .long 0x07D5 + .long 0x07D0 + .long 0x07CC + .long 0x07C7 + .long 0x07C3 + .long 0x07BE + .long 0x07BA + .long 0x07B6 + .long 0x07B1 + .long 0x07AD + .long 0x07A9 + .long 0x07A4 + .long 0x07A0 + .long 0x079B + .long 0x0797 + .long 0x0793 + .long 0x078E + .long 0x078A + .long 0x0786 + .long 0x0781 + .long 0x077D + .long 0x0779 + .long 0x0774 + .long 0x0770 + .long 0x076C + .long 0x0768 + .long 0x0763 + .long 0x075F + .long 0x075B + .long 0x0757 + .long 0x0752 + .long 0x074E + .long 0x074A + .long 0x0746 + .long 0x0741 + .long 0x073D + .long 0x0739 + .long 0x0735 + .long 0x0730 + .long 0x072C + .long 0x0728 + .long 0x0724 + .long 0x0720 + .long 0x071C + .long 0x0717 + .long 0x0713 + .long 0x070F + .long 0x070B + .long 0x0707 + .long 0x0703 + .long 0x06FE + .long 0x06FA + .long 0x06F6 + .long 0x06F2 + .long 0x06EE + .long 0x06EA + .long 0x06E6 + .long 0x06E2 + .long 0x06DE + .long 0x06DA + .long 0x06D5 + .long 0x06D1 + .long 0x06CD + .long 0x06C9 + .long 0x06C5 + .long 0x06C1 + .long 0x06BD + .long 0x06B9 + .long 0x06B5 + .long 0x06B1 + .long 0x06AD + .long 0x06A9 + .long 0x06A5 + .long 0x06A1 + .long 0x069D + .long 0x0699 + .long 0x0695 + .long 0x0691 + .long 0x068D + .long 0x0689 + .long 0x0685 + .long 0x0681 + .long 0x067D + .long 0x0679 + .long 0x0675 + .long 0x0671 + .long 0x066D + .long 0x066A + .long 0x0666 + .long 0x0662 + .long 0x065E + .long 0x065A + .long 0x0656 + .long 0x0652 + .long 0x064E + .long 0x064A + .long 0x0646 + .long 0x0643 + .long 0x063F + .long 0x063B + .long 0x0637 + .long 0x0633 + .long 0x062F + .long 0x062B + .long 0x0628 + .long 0x0624 + .long 0x0620 + .long 0x061C + .long 0x0618 + .long 0x0614 + .long 0x0611 + .long 0x060D + .long 0x0609 + .long 0x0605 + .long 0x0601 + .long 0x05FE + .long 0x05FA + .long 0x05F6 + .long 0x05F2 + .long 0x05EF + .long 0x05EB + .long 0x05E7 + .long 0x05E3 + .long 0x05E0 + .long 0x05DC + .long 0x05D8 + .long 0x05D4 + .long 0x05D1 + .long 0x05CD + .long 0x05C9 + .long 0x05C6 + .long 0x05C2 + .long 0x05BE + .long 0x05BA + .long 0x05B7 + .long 0x05B3 + .long 0x05AF + .long 0x05AC + .long 0x05A8 + .long 0x05A4 + .long 0x05A1 + .long 0x059D + .long 0x0599 + .long 0x0596 + .long 0x0592 + .long 0x058F + .long 0x058B + .long 0x0587 + .long 0x0584 + .long 0x0580 + .long 0x057C + .long 0x0579 + .long 0x0575 + .long 0x0572 + .long 0x056E + .long 0x056B + .long 0x0567 + .long 0x0563 + .long 0x0560 + .long 0x055C + .long 0x0559 + .long 0x0555 + .long 0x0552 + .long 0x054E + .long 0x054A + .long 0x0547 + .long 0x0543 + .long 0x0540 + .long 0x053C + .long 0x0539 + .long 0x0535 + .long 0x0532 + .long 0x052E + .long 0x052B + .long 0x0527 + .long 0x0524 + .long 0x0520 + .long 0x051D + .long 0x0519 + .long 0x0516 + .long 0x0512 + .long 0x050F + .long 0x050B + .long 0x0508 + .long 0x0505 + .long 0x0501 + .long 0x04FE + .long 0x04FA + .long 0x04F7 + .long 0x04F3 + .long 0x04F0 + .long 0x04EC + .long 0x04E9 + .long 0x04E6 + .long 0x04E2 + .long 0x04DF + .long 0x04DB + .long 0x04D8 + .long 0x04D5 + .long 0x04D1 + .long 0x04CE + .long 0x04CA + .long 0x04C7 + .long 0x04C4 + .long 0x04C0 + .long 0x04BD + .long 0x04BA + .long 0x04B6 + .long 0x04B3 + .long 0x04B0 + .long 0x04AC + .long 0x04A9 + .long 0x04A6 + .long 0x04A2 + .long 0x049F + .long 0x049C + .long 0x0498 + .long 0x0495 + .long 0x0492 + .long 0x048E + .long 0x048B + .long 0x0488 + .long 0x0484 + .long 0x0481 + .long 0x047E + .long 0x047B + .long 0x0477 + .long 0x0474 + .long 0x0471 + .long 0x046E + .long 0x046A + .long 0x0467 + .long 0x0464 + .long 0x0461 + .long 0x045D + .long 0x045A + .long 0x0457 + .long 0x0454 + .long 0x0450 + .long 0x044D + .long 0x044A + .long 0x0447 + .long 0x0444 + .long 0x0440 + .long 0x043D + .long 0x043A + .long 0x0437 + .long 0x0434 + .long 0x0430 + .long 0x042D + .long 0x042A + .long 0x0427 + .long 0x0424 + .long 0x0420 + .long 0x041D + .long 0x041A + .long 0x0417 + .long 0x0414 + .long 0x0411 + .long 0x040E + .long 0x040A + .long 0x0407 + .long 0x0404 + .long 0x0401 + .long 0x03FE + .long 0x03FB + .long 0x03F8 + .long 0x03F5 + .long 0x03F1 + .long 0x03EE + .long 0x03EB + .long 0x03E8 + .long 0x03E5 + .long 0x03E2 + .long 0x03DF + .long 0x03DC + .long 0x03D9 + .long 0x03D6 + .long 0x03D3 + .long 0x03CF + .long 0x03CC + .long 0x03C9 + .long 0x03C6 + .long 0x03C3 + .long 0x03C0 + .long 0x03BD + .long 0x03BA + .long 0x03B7 + .long 0x03B4 + .long 0x03B1 + .long 0x03AE + .long 0x03AB + .long 0x03A8 + .long 0x03A5 + .long 0x03A2 + .long 0x039F + .long 0x039C + .long 0x0399 + .long 0x0396 + .long 0x0393 + .long 0x0390 + .long 0x038D + .long 0x038A + .long 0x0387 + .long 0x0384 + .long 0x0381 + .long 0x037E + .long 0x037B + .long 0x0378 + .long 0x0375 + .long 0x0372 + .long 0x036F + .long 0x036C + .long 0x0369 + .long 0x0366 + .long 0x0363 + .long 0x0360 + .long 0x035E + .long 0x035B + .long 0x0358 + .long 0x0355 + .long 0x0352 + .long 0x034F + .long 0x034C + .long 0x0349 + .long 0x0346 + .long 0x0343 + .long 0x0340 + .long 0x033E + .long 0x033B + .long 0x0338 + .long 0x0335 + .long 0x0332 + .long 0x032F + .long 0x032C + .long 0x0329 + .long 0x0327 + .long 0x0324 + .long 0x0321 + .long 0x031E + .long 0x031B + .long 0x0318 + .long 0x0315 + .long 0x0313 + .long 0x0310 + .long 0x030D + .long 0x030A + .long 0x0307 + .long 0x0304 + .long 0x0302 + .long 0x02FF + .long 0x02FC + .long 0x02F9 + .long 0x02F6 + .long 0x02F3 + .long 0x02F1 + .long 0x02EE + .long 0x02EB + .long 0x02E8 + .long 0x02E5 + .long 0x02E3 + .long 0x02E0 + .long 0x02DD + .long 0x02DA + .long 0x02D8 + .long 0x02D5 + .long 0x02D2 + .long 0x02CF + .long 0x02CC + .long 0x02CA + .long 0x02C7 + .long 0x02C4 + .long 0x02C1 + .long 0x02BF + .long 0x02BC + .long 0x02B9 + .long 0x02B7 + .long 0x02B4 + .long 0x02B1 + .long 0x02AE + .long 0x02AC + .long 0x02A9 + .long 0x02A6 + .long 0x02A3 + .long 0x02A1 + .long 0x029E + .long 0x029B + .long 0x0299 + .long 0x0296 + .long 0x0293 + .long 0x0291 + .long 0x028E + .long 0x028B + .long 0x0288 + .long 0x0286 + .long 0x0283 + .long 0x0280 + .long 0x027E + .long 0x027B + .long 0x0278 + .long 0x0276 + .long 0x0273 + .long 0x0270 + .long 0x026E + .long 0x026B + .long 0x0268 + .long 0x0266 + .long 0x0263 + .long 0x0261 + .long 0x025E + .long 0x025B + .long 0x0259 + .long 0x0256 + .long 0x0253 + .long 0x0251 + .long 0x024E + .long 0x024C + .long 0x0249 + .long 0x0246 + .long 0x0244 + .long 0x0241 + .long 0x023E + .long 0x023C + .long 0x0239 + .long 0x0237 + .long 0x0234 + .long 0x0232 + .long 0x022F + .long 0x022C + .long 0x022A + .long 0x0227 + .long 0x0225 + .long 0x0222 + .long 0x021F + .long 0x021D + .long 0x021A + .long 0x0218 + .long 0x0215 + .long 0x0213 + .long 0x0210 + .long 0x020E + .long 0x020B + .long 0x0208 + .long 0x0206 + .long 0x0203 + .long 0x0201 + .long 0x01FE + .long 0x01FC + .long 0x01F9 + .long 0x01F7 + .long 0x01F4 + .long 0x01F2 + .long 0x01EF + .long 0x01ED + .long 0x01EA + .long 0x01E8 + .long 0x01E5 + .long 0x01E3 + .long 0x01E0 + .long 0x01DE + .long 0x01DB + .long 0x01D9 + .long 0x01D6 + .long 0x01D4 + .long 0x01D1 + .long 0x01CF + .long 0x01CC + .long 0x01CA + .long 0x01C7 + .long 0x01C5 + .long 0x01C2 + .long 0x01C0 + .long 0x01BD + .long 0x01BB + .long 0x01B9 + .long 0x01B6 + .long 0x01B4 + .long 0x01B1 + .long 0x01AF + .long 0x01AC + .long 0x01AA + .long 0x01A7 + .long 0x01A5 + .long 0x01A3 + .long 0x01A0 + .long 0x019E + .long 0x019B + .long 0x0199 + .long 0x0196 + .long 0x0194 + .long 0x0192 + .long 0x018F + .long 0x018D + .long 0x018A + .long 0x0188 + .long 0x0186 + .long 0x0183 + .long 0x0181 + .long 0x017E + .long 0x017C + .long 0x017A + .long 0x0177 + .long 0x0175 + .long 0x0173 + .long 0x0170 + .long 0x016E + .long 0x016B + .long 0x0169 + .long 0x0167 + .long 0x0164 + .long 0x0162 + .long 0x0160 + .long 0x015D + .long 0x015B + .long 0x0159 + .long 0x0156 + .long 0x0154 + .long 0x0151 + .long 0x014F + .long 0x014D + .long 0x014A + .long 0x0148 + .long 0x0146 + .long 0x0143 + .long 0x0141 + .long 0x013F + .long 0x013C + .long 0x013A + .long 0x0138 + .long 0x0136 + .long 0x0133 + .long 0x0131 + .long 0x012F + .long 0x012C + .long 0x012A + .long 0x0128 + .long 0x0125 + .long 0x0123 + .long 0x0121 + .long 0x011F + .long 0x011C + .long 0x011A + .long 0x0118 + .long 0x0115 + .long 0x0113 + .long 0x0111 + .long 0x010F + .long 0x010C + .long 0x010A + .long 0x0108 + .long 0x0105 + .long 0x0103 + .long 0x0101 + .long 0x00FF + .long 0x00FC + .long 0x00FA + .long 0x00F8 + .long 0x00F6 + .long 0x00F3 + .long 0x00F1 + .long 0x00EF + .long 0x00ED + .long 0x00EA + .long 0x00E8 + .long 0x00E6 + .long 0x00E4 + .long 0x00E2 + .long 0x00DF + .long 0x00DD + .long 0x00DB + .long 0x00D9 + .long 0x00D6 + .long 0x00D4 + .long 0x00D2 + .long 0x00D0 + .long 0x00CE + .long 0x00CB + .long 0x00C9 + .long 0x00C7 + .long 0x00C5 + .long 0x00C3 + .long 0x00C0 + .long 0x00BE + .long 0x00BC + .long 0x00BA + .long 0x00B8 + .long 0x00B5 + .long 0x00B3 + .long 0x00B1 + .long 0x00AF + .long 0x00AD + .long 0x00AB + .long 0x00A8 + .long 0x00A6 + .long 0x00A4 + .long 0x00A2 + .long 0x00A0 + .long 0x009E + .long 0x009B + .long 0x0099 + .long 0x0097 + .long 0x0095 + .long 0x0093 + .long 0x0091 + .long 0x008F + .long 0x008C + .long 0x008A + .long 0x0088 + .long 0x0086 + .long 0x0084 + .long 0x0082 + .long 0x0080 + .long 0x007D + .long 0x007B + .long 0x0079 + .long 0x0077 + .long 0x0075 + .long 0x0073 + .long 0x0071 + .long 0x006F + .long 0x006D + .long 0x006A + .long 0x0068 + .long 0x0066 + .long 0x0064 + .long 0x0062 + .long 0x0060 + .long 0x005E + .long 0x005C + .long 0x005A + .long 0x0058 + .long 0x0056 + .long 0x0053 + .long 0x0051 + .long 0x004F + .long 0x004D + .long 0x004B + .long 0x0049 + .long 0x0047 + .long 0x0045 + .long 0x0043 + .long 0x0041 + .long 0x003F + .long 0x003D + .long 0x003B + .long 0x0039 + .long 0x0036 + .long 0x0034 + .long 0x0032 + .long 0x0030 + .long 0x002E + .long 0x002C + .long 0x002A + .long 0x0028 + .long 0x0026 + .long 0x0024 + .long 0x0022 + .long 0x0020 + .long 0x001E + .long 0x001C + .long 0x001A + .long 0x0018 + .long 0x0016 + .long 0x0014 + .long 0x0012 + .long 0x0010 + .long 0x000E + .long 0x000C + .long 0x000A + .long 0x0008 + .long 0x0006 + .long 0x0004 + .long 0x0002 +
diff --git a/src/gas/vrd4log.S b/src/gas/vrd4log.S new file mode 100644 index 0000000..1e2b1e4 --- /dev/null +++ b/src/gas/vrd4log.S
@@ -0,0 +1,855 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrd4log.asm +# +# A vector implementation of the log libm function. +# +# Prototype: +# +# __m128d,__m128d __vrd4_log(__m128d x1, __m128d x2); +# +# Computes the natural log of x. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. This version can compute 4 logs in +# 192 cycles, or 48 per value +# +# This routine computes 4 double precision log values at a time. +# The four values are passed as packed doubles in xmm0 and xmm1. +# The four results are returned as packed doubles in xmm0 and xmm1. +# Note that this represents a non-standard ABI usage, as no ABI +# ( and indeed C) currently allows returning 2 values for a function. +# It is expected that some compilers may be able to take advantage of this +# interface when implementing vectorized loops. Using the array implementation +# of the routine requires putting the inputs into memory, and retrieving +# the results from memory. This routine eliminates the need for this +# overhead if the data does not already reside in memory. +# This routine is derived directly from the array version. +# + + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# define local variable storage offsets +.equ p_x,0 # temporary for error checking operation +.equ p_idx,0x010 # index storage +.equ p_xexp,0x020 # index storage + +.equ p_x2,0x030 # temporary for error checking operation +.equ p_idx2,0x040 # index storage +.equ p_xexp2,0x050 # index storage + +.equ save_xa,0x060 #qword +.equ save_ya,0x068 #qword +.equ save_nv,0x070 #qword +.equ p_iter,0x078 # qword storage for number of loop iterations + +.equ save_rbx,0x080 #qword + + +.equ p2_temp,0x090 # second temporary for get/put bits operation +.equ p2_temp1,0x0b0 # second temporary for exponent multiply + +.equ p_n1,0x0c0 # temporary for near one check +.equ p_n12,0x0d0 # temporary for near one check + + +.equ stack_size,0x0e8 + + +# parameters are expected as: +# xmm0 - __m128d x1 +# xmm1 - __m128d x2 + + .text + .align 16 + .p2align 4,,15 +.globl __vrd4_log + .type __vrd4_log,@function +__vrd4_log: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rdi + +# process 4 values at a time. + + movdqa %xmm1,p_x2(%rsp) # save the input values + movdqa %xmm0,p_x(%rsp) # save the input values +# compute the logs + +## if NaN or inf + +# /* Store the exponent of x in xexp and put +# f into the range [0.5,1) */ + + pxor %xmm1,%xmm1 + movdqa %xmm0,%xmm3 + psrlq $52,%xmm3 + psubq .L__mask_1023(%rip),%xmm3 + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm6 # xexp + movdqa %xmm0,%xmm2 + subpd .L__real_one(%rip),%xmm2 + + movapd %xmm6,p_xexp(%rsp) + andpd .L__real_notsign(%rip),%xmm2 + xor %rax,%rax + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + + cmppd $1,.L__real_threshold(%rip),%xmm2 + movmskpd %xmm2,%ecx + movdqa %xmm3,%xmm4 + mov %ecx,p_n1(%rsp) + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrlq $45,%xmm3 + movdqa %xmm3,%xmm2 + psrlq $1,%xmm3 + paddq .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm2 + paddq %xmm2,%xmm3 + + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm1 + pxor %xmm7,%xmm7 + movdqa p_x2(%rsp),%xmm2 + movapd p_x2(%rsp),%xmm5 + psrlq $52,%xmm2 + psubq .L__mask_1023(%rip),%xmm2 + packssdw %xmm7,%xmm2 + subpd .L__real_one(%rip),%xmm5 + andpd .L__real_notsign(%rip),%xmm5 + cvtdq2pd %xmm2,%xmm6 # xexp + xor %rcx,%rcx + cmppd $1,.L__real_threshold(%rip),%xmm5 + movq %xmm3,p_idx(%rsp) + +# reduce and get u + por .L__real_half(%rip),%xmm4 + movdqa %xmm4,%xmm2 + movapd %xmm6,p_xexp2(%rsp) + + # do near one check + movmskpd %xmm5,%edx + mov %edx,p_n12(%rsp) + + mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128 + + + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx(%rsp),%eax + movdqa p_x2(%rsp),%xmm6 + + movapd .L__real_half(%rip),%xmm5 # .5 + subpd %xmm1,%xmm2 # f2 = f - f1 + pand .L__real_mant(%rip),%xmm6 + mulpd %xmm2,%xmm5 + addpd %xmm5,%xmm1 + + movdqa %xmm6,%xmm8 + psrlq $45,%xmm6 + movdqa %xmm6,%xmm4 + + psrlq $1,%xmm6 + paddq .L__mask_040(%rip),%xmm6 + pand .L__mask_001(%rip),%xmm4 + paddq %xmm4,%xmm6 +# do error checking here for scheduling. Saves a bunch of cycles as +# compared to doing this at the start of the routine. +## if NaN or inf + movapd %xmm0,%xmm3 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r8d + packssdw %xmm7,%xmm6 + por .L__real_half(%rip),%xmm8 + movq %xmm6,p_idx2(%rsp) + cvtdq2pd %xmm6,%xmm9 + + cmppd $2,.L__real_zero(%rip),%xmm0 + mulpd .L__real_3f80000000000000(%rip),%xmm9 # f1 = index/128 + movmskpd %xmm0,%r9d +# delaying this divide helps, but moving the other one does not. +# it was after the paddq + divpd %xmm1,%xmm2 # u + +# compute the index into the log tables +# + + movlpd -512(%rdx,%rax,8),%xmm0 # z1 + mov p_idx+4(%rsp),%ecx + movhpd -512(%rdx,%rcx,8),%xmm0 # z1 +# solve for ln(1+u) + movapd %xmm2,%xmm1 # u + mulpd %xmm2,%xmm2 # u^2 + movapd %xmm2,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm2,%xmm3 #Cu2 + mulpd %xmm1,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm2 # u^5 + movapd .L__real_log2_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm1 # u+Au3 + mulpd %xmm3,%xmm2 # u5(B+Cu2) + + movapd p_xexp(%rsp),%xmm5 # xexp + addpd %xmm2,%xmm1 # poly +# recombine + mulpd %xmm5,%xmm4 # xexp * log2_lead + addpd %xmm4,%xmm0 #r1 + lea .L__np_ln_tail_table(%rip),%rdx + movlpd -512(%rdx,%rax,8),%xmm4 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm4 #z2 +=q + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx2(%rsp),%eax + mov p_idx2+4(%rsp),%ecx + addpd %xmm4,%xmm1 + + mulpd .L__real_log2_tail(%rip),%xmm5 + + movapd .L__real_half(%rip),%xmm4 # .5 + subpd %xmm9,%xmm8 # f2 = f - f1 + mulpd %xmm8,%xmm4 + addpd %xmm4,%xmm9 + + addpd %xmm5,%xmm1 #r2 + divpd %xmm9,%xmm8 # u + movapd p_x2(%rsp),%xmm3 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r10d + movapd p_x2(%rsp),%xmm6 + cmppd $2,.L__real_zero(%rip),%xmm6 + movmskpd %xmm6,%r11d + +# check for nans/infs + test $3,%r8d + addpd %xmm1,%xmm0 + jnz .L__log_naninf +.L__vlog1: +# check for negative numbers or zero + test $3,%r9d + jnz .L__z_or_n + + +.L__vlog2: + + + # It seems like a good idea to try and interleave + # even more of the following code sooner into the + # program. But there were conflicts with the table + # index registers, making the problem difficult. + # After a lot of work in a branch of this file, + # I was not able to match the speed of this version. + # CodeAnalyst shows that there is lots of unused add + # pipe time around the divides, but the processor + # doesn't seem to be able to schedule in those slots. + + movlpd -512(%rdx,%rax,8),%xmm7 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm7 #z2 +=q + +# check for near one + mov p_n1(%rsp),%r9d + test $3,%r9d + jnz .L__near_one1 +.L__vlog2n: + + + # solve for ln(1+u) + movapd %xmm8,%xmm9 # u + mulpd %xmm8,%xmm8 # u^2 + movapd %xmm8,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm8,%xmm3 #Cu2 + mulpd %xmm9,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm8 # u^5 + movapd .L__real_log2_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm9 # u+Au3 + mulpd %xmm3,%xmm8 # u5(B+Cu2) + + movapd p_xexp2(%rsp),%xmm5 # xexp + addpd %xmm8,%xmm9 # poly + # recombine + mulpd %xmm5,%xmm4 + addpd %xmm4,%xmm7 #r1 + lea .L__np_ln_tail_table(%rip),%rdx + movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q + addpd %xmm2,%xmm9 + + mulpd .L__real_log2_tail(%rip),%xmm5 + + addpd %xmm5,%xmm9 #r2 + + # check for nans/infs + test $3,%r10d + addpd %xmm9,%xmm7 + jnz .L__log_naninf2 +.L__vlog3: +# check for negative numbers or zero + test $3,%r11d + jnz .L__z_or_n2 + +.L__vlog4: + + + mov p_n12(%rsp),%r9d + test $3,%r9d + jnz .L__near_one2 + +.L__vlog4n: + +# store the result _m128d + movapd %xmm7,%xmm1 + + +.L__finish: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + + .align 16 +.Lboth_nearone: +# saves 10 cycles +# r = x - 1.0; + movapd .L__real_two(%rip),%xmm2 + subpd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addpd %xmm0,%xmm2 + movapd %xmm0,%xmm1 + divpd %xmm2,%xmm1 # u + movapd .L__real_ca4(%rip),%xmm4 #D + movapd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movapd %xmm0,%xmm6 + mulpd %xmm1,%xmm6 # correction +# u = u + u; + addpd %xmm1,%xmm1 #u + movapd %xmm1,%xmm2 + mulpd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulpd %xmm1,%xmm5 # Cu + movapd %xmm1,%xmm3 + mulpd %xmm2,%xmm3 # u^3 + mulpd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulpd %xmm3,%xmm4 #Du^3 + + addpd .L__real_ca1(%rip),%xmm2 # +A + movapd %xmm3,%xmm1 + mulpd %xmm1,%xmm1 # u^6 + addpd %xmm4,%xmm5 #Cu+Du3 + + mulpd %xmm3,%xmm2 #u3(A+Bu2) + mulpd %xmm5,%xmm1 #u6(Cu+Du3) + addpd %xmm1,%xmm2 + subpd %xmm6,%xmm2 # -correction + +# return r + r2; + addpd %xmm2,%xmm0 + ret + + .align 16 +.L__near_one1: + cmp $3,%r9d + jnz .L__n1nb1 + + movapd p_x(%rsp),%xmm0 + call .Lboth_nearone + jmp .L__vlog2n + + .align 16 +.L__n1nb1: + test $1,%r9d + jz .L__lnn12 + + movlpd p_x(%rsp),%xmm0 + call .L__ln1 + +.L__lnn12: + test $2,%r9d # second number? + jz .L__lnn1e + movlpd %xmm0,p_x(%rsp) + movlpd p_x+8(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,p_x+8(%rsp) + movapd p_x(%rsp),%xmm0 + +.L__lnn1e: + jmp .L__vlog2n + + + .align 16 +.L__near_one2: + cmp $3,%r9d + jnz .L__n1nb2 + + movapd %xmm0,%xmm8 + movapd p_x2(%rsp),%xmm0 + call .Lboth_nearone + movapd %xmm0,%xmm7 + movapd %xmm8,%xmm0 + jmp .L__vlog4n + + .align 16 +.L__n1nb2: + movapd %xmm0,%xmm8 + test $1,%r9d + jz .L__lnn22 + + movapd %xmm7,%xmm0 + movlpd p_x2(%rsp),%xmm0 + call .L__ln1 + movapd %xmm0,%xmm7 + +.L__lnn22: + test $2,%r9d # second number? + jz .L__lnn2e + movlpd %xmm7,p_x2(%rsp) + movlpd p_x2+8(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,p_x2+8(%rsp) + movapd p_x2(%rsp),%xmm7 + +.L__lnn2e: + movapd %xmm8,%xmm0 + jmp .L__vlog4n + + .align 16 + +.L__ln1: +# saves 10 cycles +# r = x - 1.0; + movlpd .L__real_two(%rip),%xmm2 + subsd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addsd %xmm0,%xmm2 + movsd %xmm0,%xmm1 + divsd %xmm2,%xmm1 # u + movlpd .L__real_ca4(%rip),%xmm4 #D + movlpd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movsd %xmm0,%xmm6 + mulsd %xmm1,%xmm6 # correction +# u = u + u; + addsd %xmm1,%xmm1 #u + movsd %xmm1,%xmm2 + mulsd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulsd %xmm1,%xmm5 # Cu + movsd %xmm1,%xmm3 + mulsd %xmm2,%xmm3 # u^3 + mulsd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulsd %xmm3,%xmm4 #Du^3 + + addsd .L__real_ca1(%rip),%xmm2 # +A + movsd %xmm3,%xmm1 + mulsd %xmm1,%xmm1 # u^6 + addsd %xmm4,%xmm5 #Cu+Du3 + + mulsd %xmm3,%xmm2 #u3(A+Bu2) + mulsd %xmm5,%xmm1 #u6(Cu+Du3) + addsd %xmm1,%xmm2 + subsd %xmm6,%xmm2 # -correction + +# return r + r2; + addsd %xmm2,%xmm0 + ret + + .align 16 + +# at least one of the numbers was a nan or infinity +.L__log_naninf: + test $1,%r8d # first number? + jz .L__lninf2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rdx + movlpd p_x(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninf2: + test $2,%r8d # second number? + jz .L__lninfe + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rdx + movlpd p_x+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninfe: + jmp .L__vlog1 # continue processing if not + +# at least one of the numbers was a nan or infinity +.L__log_naninf2: + movapd %xmm0,%xmm2 + test $1,%r10d # first number? + jz .L__lninf22 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm7,%xmm1 # save the inputs + mov p_x2(%rsp),%rdx + movlpd p_x2(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm7,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + movapd %xmm0,%xmm7 + +.L__lninf22: + test $2,%r10d # second number? + jz .L__lninfe2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2+8(%rsp),%rdx + movlpd p_x2+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninfe2: + movapd %xmm2,%xmm0 + jmp .L__vlog3 # continue processing if not + +# a subroutine to treat one number for nan/infinity +# the number is expected in rdx and returned in the low +# half of xmm0 +.L__lni: + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__lnan # jump if mantissa not zero, so it's a NaN +# inf + rcl $1,%rdx + jnc .L__lne2 # log(+inf) = inf +# negative x + movlpd .L__real_nan(%rip),%xmm0 + ret + +#NaN +.L__lnan: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__lne: + movd %rdx,%xmm0 +.L__lne2: + ret + + .align 16 + +# at least one of the numbers was a zero, a negative number, or both. +.L__z_or_n: + test $1,%r9d # first number? + jz .L__zn2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rax + call .L__zni + shufpd $2,%xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zn2: + test $2,%r9d # second number? + jz .L__zne + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zne: + jmp .L__vlog2 + +.L__z_or_n2: + movapd %xmm0,%xmm2 + test $1,%r11d # first number? + jz .L__zn22 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2(%rsp),%rax + call .L__zni + shufpd $2,%xmm7,%xmm0 + movapd %xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zn22: + test $2,%r11d # second number? + jz .L__zne2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zne2: + movapd %xmm2,%xmm0 + jmp .L__vlog4 +# a subroutine to treat one number for zero or negative values +# the number is expected in rax and returned in the low +# half of xmm0 +.L__zni: + shl $1,%rax + jnz .L__zn_x ## if just a carry, then must be negative + movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0 + ret +.L__zn_x: + movlpd .L__real_nan(%rip),%xmm0 + ret + + + + .data + .align 16 + +.L__real_one: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 # for alignment +.L__real_two: .quad 0x04000000000000000 # 1.0 + .quad 0x04000000000000000 +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0fff0000000000000 +.L__real_inf: .quad 0x07ff0000000000000 # +inf + .quad 0x07ff0000000000000 +.L__real_nan: .quad 0x07ff8000000000000 # NaN + .quad 0x07ff8000000000000 + +.L__real_zero: .quad 0x00000000000000000 # 0.0 + .quad 0x00000000000000000 + +.L__real_sign: .quad 0x08000000000000000 # sign bit + .quad 0x08000000000000000 +.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit + .quad 0x07ffFFFFFFFFFFFFF +.L__real_threshold: .quad 0x03F9EB85000000000 # .03 + .quad 0x03F9EB85000000000 +.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit + .quad 0x00008000000000000 +.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000FFFFFFFFFFFFF +.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */ + .quad 0x03f80000000000000 +.L__mask_1023: .quad 0x000000000000003ff # + .quad 0x000000000000003ff +.L__mask_040: .quad 0x00000000000000040 # + .quad 0x00000000000000040 +.L__mask_001: .quad 0x00000000000000001 # + .quad 0x00000000000000001 + +.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02 + .quad 0x03fb55555555554e6 +.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02 + .quad 0x03f89999999bac6d4 +.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03 + .quad 0x03f62492307f1519f +.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04 + .quad 0x03f3c8034c85dfff0 + + +.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02 + .quad 0x03fb5555555555557 +.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02 + .quad 0x03f89999999865ede +.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03 + .quad 0x03f6249423bd94741 +.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01 + .quad 0x03fe62e42e0000000 +.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08 + .quad 0x03e6efa39ef35793c + +.L__real_half: .quad 0x03fe0000000000000 # 1/2 + .quad 0x03fe0000000000000 + + .align 16 + +.L__np_ln_lead_table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02 + .quad 0x3f9f829800000000 # 3.07716131210327148438e-02 + .quad 0x3fa7745800000000 # 4.58095073699951171875e-02 + .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02 + .quad 0x3fb341d700000000 # 7.52233862876892089844e-02 + .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02 + .quad 0x3fba926d00000000 # 1.03796780109405517578e-01 + .quad 0x3fbe270700000000 # 1.17783010005950927734e-01 + .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01 + .quad 0x3fc2955280000000 # 1.45181953907012939453e-01 + .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01 + .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01 + .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01 + .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01 + .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01 + .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01 + .quad 0x3fce270700000000 # 2.35566020011901855469e-01 + .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01 + .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01 + .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01 + .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01 + .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01 + .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01 + .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01 + .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01 + .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01 + .quad 0x3fd686c800000000 # 3.51976394653320312500e-01 + .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01 + .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01 + .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01 + .quad 0x3fd9479400000000 # 3.94993782043457031250e-01 + .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01 + .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01 + .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01 + .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01 + .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01 + .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01 + .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01 + .quad 0x3fde744240000000 # 4.75845873355865478516e-01 + .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01 + .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01 + .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01 + .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01 + .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01 + .quad 0x3fe109f380000000 # 5.32464742660522460938e-01 + .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01 + .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01 + .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01 + .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01 + .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01 + .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01 + .quad 0x3fe307d720000000 # 5.94707071781158447266e-01 + .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01 + .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01 + .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01 + .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01 + .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01 + .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01 + .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01 + .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01 + .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01 + .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01 + .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01 + .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_tail_table: + .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00 + .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09 + .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08 + .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08 + .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08 + .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08 + .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08 + .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08 + .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08 + .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08 + .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08 + .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08 + .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08 + .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10 + .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08 + .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08 + .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08 + .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08 + .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08 + .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08 + .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08 + .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08 + .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08 + .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08 + .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09 + .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09 + .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08 + .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08 + .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08 + .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08 + .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09 + .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08 + .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08 + .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08 + .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08 + .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08 + .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09 + .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08 + .quad 0x03e33071282fb989b # 4.43021445893361960146e-09 + .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08 + .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08 + .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08 + .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08 + .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08 + .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09 + .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08 + .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08 + .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08 + .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08 + .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08 + .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08 + .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08 + .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08 + .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08 + .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08 + .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08 + .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08 + .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09 + .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08 + .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08 + .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08 + .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08 + .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08 + .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08 + .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08 + .quad 0 # for alignment +
diff --git a/src/gas/vrd4log10.S b/src/gas/vrd4log10.S new file mode 100644 index 0000000..d0f861c --- /dev/null +++ b/src/gas/vrd4log10.S
@@ -0,0 +1,924 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrd4log10.asm +# +# A vector implementation of the log10 libm function. +# +# Prototype: +# +# __m128d,__m128d __vrd4_log10(__m128d x1, __m128d x2); +# +# Computes the natural log10 of x. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. This version can compute 4 log10s in +# 220 cycles, or 55 per value +# +# This routine computes 4 double precision log10 values at a time. +# The four values are passed as packed doubles in xmm0 and xmm1. +# The four results are returned as packed doubles in xmm0 and xmm1. +# Note that this represents a non-standard ABI usage, as no ABI +# ( and indeed C) currently allows returning 2 values for a function. +# It is expected that some compilers may be able to take advantage of this +# interface when implementing vectorized loops. Using the array implementation +# of the routine requires putting the inputs into memory, and retrieving +# the results from memory. This routine eliminates the need for this +# overhead if the data does not already reside in memory. +# This routine is derived directly from the array version. +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + +# define local variable storage offsets +.equ p_x,0 # temporary for error checking operation +.equ p_idx,0x010 # index storage +.equ p_xexp,0x020 # index storage + +.equ p_x2,0x030 # temporary for error checking operation +.equ p_idx2,0x040 # index storage +.equ p_xexp2,0x050 # index storage + +.equ save_xa,0x060 #qword +.equ save_ya,0x068 #qword +.equ save_nv,0x070 #qword +.equ p_iter,0x078 # qword storage for number of loop iterations + +.equ save_rbx,0x080 #qword + + +.equ p2_temp,0x090 # second temporary for get/put bits operation +.equ p2_temp1,0x0b0 # second temporary for exponent multiply + +.equ p_n1,0x0c0 # temporary for near one check +.equ p_n12,0x0d0 # temporary for near one check + + +.equ stack_size,0x0e8 + + +# parameters are expected as: +# xmm0 - __m128d x1 +# xmm1 - __m128d x2 + + .text + .align 16 + .p2align 4,,15 +.globl __vrd4_log10 + .type __vrd4_log10,@function +__vrd4_log10: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rdi + +# process 4 values at a time. + + movdqa %xmm1,p_x2(%rsp) # save the input values + movdqa %xmm0,p_x(%rsp) # save the input values +# compute the log10s + +## if NaN or inf + +# /* Store the exponent of x in xexp and put +# f into the range [0.5,1) */ + + pxor %xmm1,%xmm1 + movdqa %xmm0,%xmm3 + psrlq $52,%xmm3 + psubq .L__mask_1023(%rip),%xmm3 + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm6 # xexp + movdqa %xmm0,%xmm2 + subpd .L__real_one(%rip),%xmm2 + + movapd %xmm6,p_xexp(%rsp) + andpd .L__real_notsign(%rip),%xmm2 + xor %rax,%rax + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + + cmppd $1,.L__real_threshold(%rip),%xmm2 + movmskpd %xmm2,%ecx + movdqa %xmm3,%xmm4 + mov %ecx,p_n1(%rsp) + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrlq $45,%xmm3 + movdqa %xmm3,%xmm2 + psrlq $1,%xmm3 + paddq .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm2 + paddq %xmm2,%xmm3 + + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm1 + pxor %xmm7,%xmm7 + movdqa p_x2(%rsp),%xmm2 + movapd p_x2(%rsp),%xmm5 + psrlq $52,%xmm2 + psubq .L__mask_1023(%rip),%xmm2 + packssdw %xmm7,%xmm2 + subpd .L__real_one(%rip),%xmm5 + andpd .L__real_notsign(%rip),%xmm5 + cvtdq2pd %xmm2,%xmm6 # xexp + xor %rcx,%rcx + cmppd $1,.L__real_threshold(%rip),%xmm5 + movq %xmm3,p_idx(%rsp) + +# reduce and get u + por .L__real_half(%rip),%xmm4 + movdqa %xmm4,%xmm2 + movapd %xmm6,p_xexp2(%rsp) + + # do near one check + movmskpd %xmm5,%edx + mov %edx,p_n12(%rsp) + + mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128 + + + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx(%rsp),%eax + movdqa p_x2(%rsp),%xmm6 + + movapd .L__real_half(%rip),%xmm5 # .5 + subpd %xmm1,%xmm2 # f2 = f - f1 + pand .L__real_mant(%rip),%xmm6 + mulpd %xmm2,%xmm5 + addpd %xmm5,%xmm1 + + movdqa %xmm6,%xmm8 + psrlq $45,%xmm6 + movdqa %xmm6,%xmm4 + + psrlq $1,%xmm6 + paddq .L__mask_040(%rip),%xmm6 + pand .L__mask_001(%rip),%xmm4 + paddq %xmm4,%xmm6 +# do error checking here for scheduling. Saves a bunch of cycles as +# compared to doing this at the start of the routine. +## if NaN or inf + movapd %xmm0,%xmm3 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r8d + packssdw %xmm7,%xmm6 + por .L__real_half(%rip),%xmm8 + movq %xmm6,p_idx2(%rsp) + cvtdq2pd %xmm6,%xmm9 + + cmppd $2,.L__real_zero(%rip),%xmm0 + mulpd .L__real_3f80000000000000(%rip),%xmm9 # f1 = index/128 + movmskpd %xmm0,%r9d +# delaying this divide helps, but moving the other one does not. +# it was after the paddq + divpd %xmm1,%xmm2 # u + +# compute the index into the log10 tables +# + + movlpd -512(%rdx,%rax,8),%xmm0 # z1 + mov p_idx+4(%rsp),%ecx + movhpd -512(%rdx,%rcx,8),%xmm0 # z1 +# solve for ln(1+u) + movapd %xmm2,%xmm1 # u + mulpd %xmm2,%xmm2 # u^2 + movapd %xmm2,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm2,%xmm3 #Cu2 + mulpd %xmm1,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm2 # u^5 + movapd .L__real_log2_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm1 # u+Au3 + mulpd %xmm3,%xmm2 # u5(B+Cu2) + + movapd p_xexp(%rsp),%xmm5 # xexp + addpd %xmm2,%xmm1 # poly +# recombine + mulpd %xmm5,%xmm4 # xexp * log2_lead + addpd %xmm4,%xmm0 #r1 + movapd %xmm0,%xmm2 #for log10 + lea .L__np_ln_tail_table(%rip),%rdx + movlpd -512(%rdx,%rax,8),%xmm4 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm4 #z2 +=q + mulpd .L__real_log10e_tail(%rip),%xmm0 #for log10 + mulpd .L__real_log10e_lead(%rip),%xmm2 #for log10 + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx2(%rsp),%eax + mov p_idx2+4(%rsp),%ecx + addpd %xmm4,%xmm1 + + + + mulpd .L__real_log2_tail(%rip),%xmm5 + + movapd .L__real_half(%rip),%xmm4 # .5 + + + subpd %xmm9,%xmm8 # f2 = f - f1 + mulpd %xmm8,%xmm4 + addpd %xmm4,%xmm9 + + addpd %xmm5,%xmm1 #r2 + movapd %xmm1,%xmm7 #for log10 + mulpd .L__real_log10e_tail(%rip),%xmm1 #for log10 + addpd %xmm1,%xmm0 #for log10 + + + divpd %xmm9,%xmm8 # u + movapd p_x2(%rsp),%xmm3 + mulpd .L__real_log10e_lead(%rip),%xmm7 #log10 + andpd .L__real_inf(%rip),%xmm3 + + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r10d + addpd %xmm7,%xmm0 #for log10 + movapd p_x2(%rsp),%xmm6 + cmppd $2,.L__real_zero(%rip),%xmm6 + movmskpd %xmm6,%r11d + + + + +# check for nans/infs + test $3,%r8d + addpd %xmm2,%xmm0 #for log10 +# addpd %xmm1,%xmm0 + jnz .L__log_naninf +.L__vlog1: +# check for negative numbers or zero + test $3,%r9d + jnz .L__z_or_n + + +.L__vlog2: + + + # It seems like a good idea to try and interleave + # even more of the following code sooner into the + # program. But there were conflicts with the table + # index registers, making the problem difficult. + # After a lot of work in a branch of this file, + # I was not able to match the speed of this version. + # CodeAnalyst shows that there is lots of unused add + # pipe time around the divides, but the processor + # doesn't seem to be able to schedule in those slots. + + movlpd -512(%rdx,%rax,8),%xmm7 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm7 #z2 +=q + +# check for near one + mov p_n1(%rsp),%r9d + test $3,%r9d + jnz .L__near_one1 +.L__vlog2n: + + + # solve for ln(1+u) + movapd %xmm8,%xmm9 # u + mulpd %xmm8,%xmm8 # u^2 + movapd %xmm8,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm8,%xmm3 #Cu2 + mulpd %xmm9,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm8 # u^5 + movapd .L__real_log2_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm9 # u+Au3 + mulpd %xmm3,%xmm8 # u5(B+Cu2) + + movapd p_xexp2(%rsp),%xmm5 # xexp + addpd %xmm8,%xmm9 # poly + # recombine + mulpd %xmm5,%xmm4 + addpd %xmm4,%xmm7 #r1 + movapd %xmm7,%xmm6 #for log10 + + lea .L__np_ln_tail_table(%rip),%rdx + mulpd .L__real_log10e_tail(%rip),%xmm7 #for log10 + movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q + mulpd .L__real_log10e_lead(%rip),%xmm6 #for log10 + movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q + addpd %xmm2,%xmm9 + + mulpd .L__real_log2_tail(%rip),%xmm5 + + addpd %xmm5,%xmm9 #r2 + movapd %xmm9,%xmm8 #for log10 + mulpd .L__real_log10e_tail(%rip),%xmm9 #for log 10 + addpd %xmm9,%xmm7 #for log10 + mulpd .L__real_log10e_lead(%rip),%xmm8 #for log10 + addpd %xmm8,%xmm7 #for log10 + + # check for nans/infs + test $3,%r10d + addpd %xmm6,%xmm7 #for log10 +# addpd %xmm9,%xmm7 + jnz .L__log_naninf2 +.L__vlog3: +# check for negative numbers or zero + test $3,%r11d + jnz .L__z_or_n2 + +.L__vlog4: + + + mov p_n12(%rsp),%r9d + test $3,%r9d + jnz .L__near_one2 + +.L__vlog4n: + +# store the result _m128d + movapd %xmm7,%xmm1 + + +.L__finish: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + + .align 16 +.Lboth_nearone: +# saves 10 cycles +# r = x - 1.0; + movapd .L__real_two(%rip),%xmm2 + subpd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addpd %xmm0,%xmm2 + movapd %xmm0,%xmm1 + divpd %xmm2,%xmm1 # u + movapd .L__real_ca4(%rip),%xmm4 #D + movapd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movapd %xmm0,%xmm6 + mulpd %xmm1,%xmm6 # correction +# u = u + u; + addpd %xmm1,%xmm1 #u + movapd %xmm1,%xmm2 + mulpd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulpd %xmm1,%xmm5 # Cu + movapd %xmm1,%xmm3 + mulpd %xmm2,%xmm3 # u^3 + mulpd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulpd %xmm3,%xmm4 #Du^3 + + addpd .L__real_ca1(%rip),%xmm2 # +A + movapd %xmm3,%xmm1 + mulpd %xmm1,%xmm1 # u^6 + addpd %xmm4,%xmm5 #Cu+Du3 + + mulpd %xmm3,%xmm2 #u3(A+Bu2) + mulpd %xmm5,%xmm1 #u6(Cu+Du3) + addpd %xmm1,%xmm2 + subpd %xmm6,%xmm2 # -correction + +# loge to log10 + movapd %xmm0,%xmm3 #r1 = r + pand .L__mask_lower(%rip),%xmm3 + subpd %xmm3,%xmm0 + addpd %xmm0,%xmm2 #r2 = r2 + (r - r1); + + movapd %xmm3,%xmm0 + movapd %xmm2,%xmm1 + + mulpd .L__real_log10e_tail(%rip),%xmm2 + mulpd .L__real_log10e_tail(%rip),%xmm0 + mulpd .L__real_log10e_lead(%rip),%xmm1 + mulpd .L__real_log10e_lead(%rip),%xmm3 + addpd %xmm2,%xmm0 + addpd %xmm1,%xmm0 + addpd %xmm3,%xmm0 +# return r + r2; +# addpd %xmm2,%xmm0 + ret + + .align 16 +.L__near_one1: + cmp $3,%r9d + jnz .L__n1nb1 + + movapd p_x(%rsp),%xmm0 + call .Lboth_nearone + jmp .L__vlog2n + + .align 16 +.L__n1nb1: + test $1,%r9d + jz .L__lnn12 + + movlpd p_x(%rsp),%xmm0 + call .L__ln1 + +.L__lnn12: + test $2,%r9d # second number? + jz .L__lnn1e + movlpd %xmm0,p_x(%rsp) + movlpd p_x+8(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,p_x+8(%rsp) + movapd p_x(%rsp),%xmm0 + +.L__lnn1e: + jmp .L__vlog2n + + + .align 16 +.L__near_one2: + cmp $3,%r9d + jnz .L__n1nb2 + + movapd %xmm0,%xmm8 + movapd p_x2(%rsp),%xmm0 + call .Lboth_nearone + movapd %xmm0,%xmm7 + movapd %xmm8,%xmm0 + jmp .L__vlog4n + + .align 16 +.L__n1nb2: + movapd %xmm0,%xmm8 + test $1,%r9d + jz .L__lnn22 + + movapd %xmm7,%xmm0 + movlpd p_x2(%rsp),%xmm0 + call .L__ln1 + movapd %xmm0,%xmm7 + +.L__lnn22: + test $2,%r9d # second number? + jz .L__lnn2e + movlpd %xmm7,p_x2(%rsp) + movlpd p_x2+8(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,p_x2+8(%rsp) + movapd p_x2(%rsp),%xmm7 + +.L__lnn2e: + movapd %xmm8,%xmm0 + jmp .L__vlog4n + + .align 16 + +.L__ln1: +# saves 10 cycles +# r = x - 1.0; + movlpd .L__real_two(%rip),%xmm2 + subsd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addsd %xmm0,%xmm2 + movsd %xmm0,%xmm1 + divsd %xmm2,%xmm1 # u + movlpd .L__real_ca4(%rip),%xmm4 #D + movlpd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movsd %xmm0,%xmm6 + mulsd %xmm1,%xmm6 # correction +# u = u + u; + addsd %xmm1,%xmm1 #u + movsd %xmm1,%xmm2 + mulsd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulsd %xmm1,%xmm5 # Cu + movsd %xmm1,%xmm3 + mulsd %xmm2,%xmm3 # u^3 + mulsd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulsd %xmm3,%xmm4 #Du^3 + + addsd .L__real_ca1(%rip),%xmm2 # +A + movsd %xmm3,%xmm1 + mulsd %xmm1,%xmm1 # u^6 + addsd %xmm4,%xmm5 #Cu+Du3 + + mulsd %xmm3,%xmm2 #u3(A+Bu2) + mulsd %xmm5,%xmm1 #u6(Cu+Du3) + addsd %xmm1,%xmm2 + subsd %xmm6,%xmm2 # -correction + +# loge to log10 + movsd %xmm0,%xmm3 #r1 = r + pand .L__mask_lower(%rip),%xmm3 + subsd %xmm3,%xmm0 + addsd %xmm0,%xmm2 #r2 = r2 + (r - r1); + + movsd %xmm3,%xmm0 + movsd %xmm2,%xmm1 + + mulsd .L__real_log10e_tail(%rip),%xmm2 + mulsd .L__real_log10e_tail(%rip),%xmm0 + mulsd .L__real_log10e_lead(%rip),%xmm1 + mulsd .L__real_log10e_lead(%rip),%xmm3 + addsd %xmm2,%xmm0 + addsd %xmm1,%xmm0 + addsd %xmm3,%xmm0 + +# return r + r2; +# addsd %xmm2,%xmm0 + ret + + .align 16 + +# at least one of the numbers was a nan or infinity +.L__log_naninf: + test $1,%r8d # first number? + jz .L__lninf2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rdx + movlpd p_x(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninf2: + test $2,%r8d # second number? + jz .L__lninfe + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rdx + movlpd p_x+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninfe: + jmp .L__vlog1 # continue processing if not + +# at least one of the numbers was a nan or infinity +.L__log_naninf2: + movapd %xmm0,%xmm2 + test $1,%r10d # first number? + jz .L__lninf22 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm7,%xmm1 # save the inputs + mov p_x2(%rsp),%rdx + movlpd p_x2(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm7,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + movapd %xmm0,%xmm7 + +.L__lninf22: + test $2,%r10d # second number? + jz .L__lninfe2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2+8(%rsp),%rdx + movlpd p_x2+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninfe2: + movapd %xmm2,%xmm0 + jmp .L__vlog3 # continue processing if not + +# a subroutine to treat one number for nan/infinity +# the number is expected in rdx and returned in the low +# half of xmm0 +.L__lni: + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__lnan # jump if mantissa not zero, so it's a NaN +# inf + rcl $1,%rdx + jnc .L__lne2 # log(+inf) = inf +# negative x + movlpd .L__real_nan(%rip),%xmm0 + ret + +#NaN +.L__lnan: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__lne: + movd %rdx,%xmm0 +.L__lne2: + ret + + .align 16 + +# at least one of the numbers was a zero, a negative number, or both. +.L__z_or_n: + test $1,%r9d # first number? + jz .L__zn2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rax + call .L__zni + shufpd $2,%xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zn2: + test $2,%r9d # second number? + jz .L__zne + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zne: + jmp .L__vlog2 + +.L__z_or_n2: + movapd %xmm0,%xmm2 + test $1,%r11d # first number? + jz .L__zn22 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2(%rsp),%rax + call .L__zni + shufpd $2,%xmm7,%xmm0 + movapd %xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zn22: + test $2,%r11d # second number? + jz .L__zne2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zne2: + movapd %xmm2,%xmm0 + jmp .L__vlog4 +# a subroutine to treat one number for zero or negative values +# the number is expected in rax and returned in the low +# half of xmm0 +.L__zni: + shl $1,%rax + jnz .L__zn_x ## if just a carry, then must be negative + movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0 + ret +.L__zn_x: + movlpd .L__real_nan(%rip),%xmm0 + ret + + + + .data + .align 16 + +.L__real_one: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 # for alignment +.L__real_two: .quad 0x04000000000000000 # 1.0 + .quad 0x04000000000000000 +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0fff0000000000000 +.L__real_inf: .quad 0x07ff0000000000000 # +inf + .quad 0x07ff0000000000000 +.L__real_nan: .quad 0x07ff8000000000000 # NaN + .quad 0x07ff8000000000000 + +.L__real_zero: .quad 0x00000000000000000 # 0.0 + .quad 0x00000000000000000 + +.L__real_sign: .quad 0x08000000000000000 # sign bit + .quad 0x08000000000000000 +.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit + .quad 0x07ffFFFFFFFFFFFFF +.L__real_threshold: .quad 0x03FB082C000000000 # .064495086669921875 Threshold + .quad 0x03FB082C000000000 +.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit + .quad 0x00008000000000000 +.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000FFFFFFFFFFFFF +.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */ + .quad 0x03f80000000000000 +.L__mask_1023: .quad 0x000000000000003ff # + .quad 0x000000000000003ff +.L__mask_040: .quad 0x00000000000000040 # + .quad 0x00000000000000040 +.L__mask_001: .quad 0x00000000000000001 # + .quad 0x00000000000000001 + +.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02 + .quad 0x03fb55555555554e6 +.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02 + .quad 0x03f89999999bac6d4 +.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03 + .quad 0x03f62492307f1519f +.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04 + .quad 0x03f3c8034c85dfff0 + + +.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02 + .quad 0x03fb5555555555557 +.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02 + .quad 0x03f89999999865ede +.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03 + .quad 0x03f6249423bd94741 +.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01 + .quad 0x03fe62e42e0000000 +.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08 + .quad 0x03e6efa39ef35793c + +.L__real_half: .quad 0x03fe0000000000000 # 1/2 + .quad 0x03fe0000000000000 + +.L__real_log10e_lead: .quad 0x03fdbcb7800000000 # log10e_lead 4.34293746948242187500e-01 + .quad 0x03fdbcb7800000000 +.L__real_log10e_tail: .quad 0x03ea8a93728719535 # log10e_tail 7.3495500964015109100644e-7 + .quad 0x03ea8a93728719535 + +.L__mask_lower: .quad 0x0ffffffff00000000 + .quad 0x0ffffffff00000000 + .align 16 + +.L__np_ln_lead_table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02 + .quad 0x3f9f829800000000 # 3.07716131210327148438e-02 + .quad 0x3fa7745800000000 # 4.58095073699951171875e-02 + .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02 + .quad 0x3fb341d700000000 # 7.52233862876892089844e-02 + .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02 + .quad 0x3fba926d00000000 # 1.03796780109405517578e-01 + .quad 0x3fbe270700000000 # 1.17783010005950927734e-01 + .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01 + .quad 0x3fc2955280000000 # 1.45181953907012939453e-01 + .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01 + .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01 + .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01 + .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01 + .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01 + .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01 + .quad 0x3fce270700000000 # 2.35566020011901855469e-01 + .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01 + .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01 + .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01 + .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01 + .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01 + .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01 + .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01 + .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01 + .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01 + .quad 0x3fd686c800000000 # 3.51976394653320312500e-01 + .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01 + .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01 + .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01 + .quad 0x3fd9479400000000 # 3.94993782043457031250e-01 + .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01 + .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01 + .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01 + .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01 + .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01 + .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01 + .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01 + .quad 0x3fde744240000000 # 4.75845873355865478516e-01 + .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01 + .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01 + .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01 + .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01 + .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01 + .quad 0x3fe109f380000000 # 5.32464742660522460938e-01 + .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01 + .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01 + .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01 + .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01 + .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01 + .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01 + .quad 0x3fe307d720000000 # 5.94707071781158447266e-01 + .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01 + .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01 + .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01 + .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01 + .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01 + .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01 + .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01 + .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01 + .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01 + .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01 + .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01 + .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_tail_table: + .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00 + .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09 + .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08 + .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08 + .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08 + .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08 + .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08 + .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08 + .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08 + .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08 + .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08 + .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08 + .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08 + .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10 + .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08 + .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08 + .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08 + .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08 + .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08 + .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08 + .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08 + .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08 + .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08 + .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08 + .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09 + .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09 + .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08 + .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08 + .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08 + .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08 + .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09 + .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08 + .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08 + .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08 + .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08 + .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08 + .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09 + .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08 + .quad 0x03e33071282fb989b # 4.43021445893361960146e-09 + .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08 + .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08 + .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08 + .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08 + .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08 + .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09 + .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08 + .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08 + .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08 + .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08 + .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08 + .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08 + .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08 + .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08 + .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08 + .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08 + .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08 + .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08 + .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09 + .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08 + .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08 + .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08 + .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08 + .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08 + .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08 + .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08 + .quad 0 # for alignment +
diff --git a/src/gas/vrd4log2.S b/src/gas/vrd4log2.S new file mode 100644 index 0000000..bc254cf --- /dev/null +++ b/src/gas/vrd4log2.S
@@ -0,0 +1,908 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrd4log2.asm +# +# A vector implementation of the log libm function. +# +# Prototype: +# +# __m128d,__m128d __vrd4_log2(__m128d x1, __m128d x2); +# +# Computes the natural log of x. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. This version can compute 4 logs in +# 192 cycles, or 48 per value +# +# This routine computes 4 double precision log values at a time. +# The four values are passed as packed doubles in xmm0 and xmm1. +# The four results are returned as packed doubles in xmm0 and xmm1. +# Note that this represents a non-standard ABI usage, as no ABI +# ( and indeed C) currently allows returning 2 values for a function. +# It is expected that some compilers may be able to take advantage of this +# interface when implementing vectorized loops. Using the array implementation +# of the routine requires putting the inputs into memory, and retrieving +# the results from memory. This routine eliminates the need for this +# overhead if the data does not already reside in memory. +# This routine is derived directly from the array version. +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + +# define local variable storage offsets +.equ p_x,0 # temporary for error checking operation +.equ p_idx,0x010 # index storage +.equ p_xexp,0x020 # index storage + +.equ p_x2,0x030 # temporary for error checking operation +.equ p_idx2,0x040 # index storage +.equ p_xexp2,0x050 # index storage + +.equ save_xa,0x060 #qword +.equ save_ya,0x068 #qword +.equ save_nv,0x070 #qword +.equ p_iter,0x078 # qword storage for number of loop iterations + +.equ save_rbx,0x080 #qword + + +.equ p2_temp,0x090 # second temporary for get/put bits operation +.equ p2_temp1,0x0b0 # second temporary for exponent multiply + +.equ p_n1,0x0c0 # temporary for near one check +.equ p_n12,0x0d0 # temporary for near one check + + +.equ stack_size,0x0e8 + + +# parameters are expected as: +# xmm0 - __m128d x1 +# xmm1 - __m128d x2 + + .text + .align 16 + .p2align 4,,15 +.globl __vrd4_log2 + .type __vrd4_log2,@function +__vrd4_log2: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rdi + +# process 4 values at a time. + + movdqa %xmm1,p_x2(%rsp) # save the input values + movdqa %xmm0,p_x(%rsp) # save the input values +# compute the logs + +## if NaN or inf + +# /* Store the exponent of x in xexp and put +# f into the range [0.5,1) */ + + pxor %xmm1,%xmm1 + movdqa %xmm0,%xmm3 + psrlq $52,%xmm3 + psubq .L__mask_1023(%rip),%xmm3 + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm6 # xexp + movdqa %xmm0,%xmm2 + subpd .L__real_one(%rip),%xmm2 + + movapd %xmm6,p_xexp(%rsp) + andpd .L__real_notsign(%rip),%xmm2 + xor %rax,%rax + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + + cmppd $1,.L__real_threshold(%rip),%xmm2 + movmskpd %xmm2,%ecx + movdqa %xmm3,%xmm4 + mov %ecx,p_n1(%rsp) + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrlq $45,%xmm3 + movdqa %xmm3,%xmm2 + psrlq $1,%xmm3 + paddq .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm2 + paddq %xmm2,%xmm3 + + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm1 + pxor %xmm7,%xmm7 + movdqa p_x2(%rsp),%xmm2 + movapd p_x2(%rsp),%xmm5 + psrlq $52,%xmm2 + psubq .L__mask_1023(%rip),%xmm2 + packssdw %xmm7,%xmm2 + subpd .L__real_one(%rip),%xmm5 + andpd .L__real_notsign(%rip),%xmm5 + cvtdq2pd %xmm2,%xmm6 # xexp + xor %rcx,%rcx + cmppd $1,.L__real_threshold(%rip),%xmm5 + movq %xmm3,p_idx(%rsp) + +# reduce and get u + por .L__real_half(%rip),%xmm4 + movdqa %xmm4,%xmm2 + movapd %xmm6,p_xexp2(%rsp) + + # do near one check + movmskpd %xmm5,%edx + mov %edx,p_n12(%rsp) + + mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128 + + + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx(%rsp),%eax + movdqa p_x2(%rsp),%xmm6 + + movapd .L__real_half(%rip),%xmm5 # .5 + subpd %xmm1,%xmm2 # f2 = f - f1 + pand .L__real_mant(%rip),%xmm6 + mulpd %xmm2,%xmm5 + addpd %xmm5,%xmm1 + + movdqa %xmm6,%xmm8 + psrlq $45,%xmm6 + movdqa %xmm6,%xmm4 + + psrlq $1,%xmm6 + paddq .L__mask_040(%rip),%xmm6 + pand .L__mask_001(%rip),%xmm4 + paddq %xmm4,%xmm6 +# do error checking here for scheduling. Saves a bunch of cycles as +# compared to doing this at the start of the routine. +## if NaN or inf + movapd %xmm0,%xmm3 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r8d + packssdw %xmm7,%xmm6 + por .L__real_half(%rip),%xmm8 + movq %xmm6,p_idx2(%rsp) + cvtdq2pd %xmm6,%xmm9 + + cmppd $2,.L__real_zero(%rip),%xmm0 + mulpd .L__real_3f80000000000000(%rip),%xmm9 # f1 = index/128 + movmskpd %xmm0,%r9d +# delaying this divide helps, but moving the other one does not. +# it was after the paddq + divpd %xmm1,%xmm2 # u + +# compute the index into the log tables +# + + movlpd -512(%rdx,%rax,8),%xmm0 # z1 + mov p_idx+4(%rsp),%ecx + movhpd -512(%rdx,%rcx,8),%xmm0 # z1 +# solve for ln(1+u) + movapd %xmm2,%xmm1 # u + mulpd %xmm2,%xmm2 # u^2 + movapd %xmm2,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm2,%xmm3 #Cu2 + mulpd %xmm1,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm2 # u^5 + movapd .L__real_log2e_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm1 # u+Au3 + movapd %xmm0,%xmm5 #z1 copy + mulpd %xmm3,%xmm2 # u5(B+Cu2) + movapd .L__real_log2e_tail(%rip),%xmm3 + + movapd p_xexp(%rsp),%xmm6 # xexp + addpd %xmm2,%xmm1 # poly +# recombine + lea .L__np_ln_tail_table(%rip),%rdx + movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx2(%rsp),%eax + mov p_idx2+4(%rsp),%ecx + addpd %xmm2,%xmm1 #z2 + movapd %xmm1,%xmm2 #z2 copy + + + mulpd %xmm4,%xmm5 + mulpd %xmm4,%xmm1 + movapd .L__real_half(%rip),%xmm4 # .5 + subpd %xmm9,%xmm8 # f2 = f - f1 + mulpd %xmm8,%xmm4 + addpd %xmm4,%xmm9 + mulpd %xmm3,%xmm2 #z2*log2e_tail + mulpd %xmm3,%xmm0 #z1*log2e_tail + addpd %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp + addpd %xmm2,%xmm0 #z1*log2e_tail + z2*log2e_tail + addpd %xmm1,%xmm0 #r2 + + + divpd %xmm9,%xmm8 # u + movapd p_x2(%rsp),%xmm3 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r10d + movapd p_x2(%rsp),%xmm6 + cmppd $2,.L__real_zero(%rip),%xmm6 + movmskpd %xmm6,%r11d + +# check for nans/infs + test $3,%r8d + addpd %xmm5,%xmm0 #r1+r2 + jnz .L__log_naninf +.L__vlog1: +# check for negative numbers or zero + test $3,%r9d + jnz .L__z_or_n + + +.L__vlog2: + + + # It seems like a good idea to try and interleave + # even more of the following code sooner into the + # program. But there were conflicts with the table + # index registers, making the problem difficult. + # After a lot of work in a branch of this file, + # I was not able to match the speed of this version. + # CodeAnalyst shows that there is lots of unused add + # pipe time around the divides, but the processor + # doesn't seem to be able to schedule in those slots. + + movlpd -512(%rdx,%rax,8),%xmm7 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm7 #z2 +=q + +# check for near one + mov p_n1(%rsp),%r9d + test $3,%r9d + jnz .L__near_one1 +.L__vlog2n: + + + # solve for ln(1+u) + movapd %xmm8,%xmm9 # u + mulpd %xmm8,%xmm8 # u^2 + movapd %xmm8,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm8,%xmm3 #Cu2 + mulpd %xmm9,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm8 # u^5 + movapd .L__real_log2e_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm9 # u+Au3 + movapd %xmm7,%xmm5 #z1 copy + mulpd %xmm3,%xmm8 # u5(B+Cu2) + movapd .L__real_log2e_tail(%rip),%xmm3 + movapd p_xexp2(%rsp),%xmm6 # xexp + addpd %xmm8,%xmm9 # poly + # recombine + lea .L__np_ln_tail_table(%rip),%rdx + movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q + addpd %xmm2,%xmm9 #z2 + movapd %xmm9,%xmm2 #z2 copy + + mulpd %xmm4,%xmm5 #z1*log2e_lead + mulpd %xmm4,%xmm9 #z2*log2e_lead + mulpd %xmm3,%xmm2 #z2*log2e_tail + mulpd %xmm3,%xmm7 #z1*log2e_tail + addpd %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp + addpd %xmm2,%xmm7 #z1*log2e_tail + z2*log2e_tail + + + addpd %xmm9,%xmm7 #r2 + + # check for nans/infs + test $3,%r10d + addpd %xmm5,%xmm7 + jnz .L__log_naninf2 +.L__vlog3: +# check for negative numbers or zero + test $3,%r11d + jnz .L__z_or_n2 + +.L__vlog4: + + + mov p_n12(%rsp),%r9d + test $3,%r9d + jnz .L__near_one2 + +.L__vlog4n: + +# store the result _m128d + movapd %xmm7,%xmm1 + + +.L__finish: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + + .align 16 +.Lboth_nearone: +# saves 10 cycles +# r = x - 1.0; + movapd .L__real_two(%rip),%xmm2 + subpd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addpd %xmm0,%xmm2 + movapd %xmm0,%xmm1 + divpd %xmm2,%xmm1 # u + movapd .L__real_ca4(%rip),%xmm4 #D + movapd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movapd %xmm0,%xmm6 + mulpd %xmm1,%xmm6 # correction +# u = u + u; + addpd %xmm1,%xmm1 #u + movapd %xmm1,%xmm2 + mulpd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulpd %xmm1,%xmm5 # Cu + movapd %xmm1,%xmm3 + mulpd %xmm2,%xmm3 # u^3 + mulpd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulpd %xmm3,%xmm4 #Du^3 + + addpd .L__real_ca1(%rip),%xmm2 # +A + movapd %xmm3,%xmm1 + mulpd %xmm1,%xmm1 # u^6 + addpd %xmm4,%xmm5 #Cu+Du3 + + mulpd %xmm3,%xmm2 #u3(A+Bu2) + mulpd %xmm5,%xmm1 #u6(Cu+Du3) + addpd %xmm1,%xmm2 + subpd %xmm6,%xmm2 # -correction + +# loge to log2 + movapd %xmm0,%xmm3 #r1 = r + pand .L__mask_lower(%rip),%xmm3 + subpd %xmm3,%xmm0 + addpd %xmm0,%xmm2 #r2 = r2 + (r - r1); + + movapd %xmm3,%xmm0 + movapd %xmm2,%xmm1 + + mulpd .L__real_log2e_tail(%rip),%xmm2 + mulpd .L__real_log2e_tail(%rip),%xmm0 + mulpd .L__real_log2e_lead(%rip),%xmm1 + mulpd .L__real_log2e_lead(%rip),%xmm3 + addpd %xmm2,%xmm0 + addpd %xmm1,%xmm0 + addpd %xmm3,%xmm0 + +# return r + r2; +# addpd %xmm2,%xmm0 + ret + + .align 16 +.L__near_one1: + cmp $3,%r9d + jnz .L__n1nb1 + + movapd p_x(%rsp),%xmm0 + call .Lboth_nearone + jmp .L__vlog2n + + .align 16 +.L__n1nb1: + test $1,%r9d + jz .L__lnn12 + + movlpd p_x(%rsp),%xmm0 + call .L__ln1 + +.L__lnn12: + test $2,%r9d # second number? + jz .L__lnn1e + movlpd %xmm0,p_x(%rsp) + movlpd p_x+8(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,p_x+8(%rsp) + movapd p_x(%rsp),%xmm0 + +.L__lnn1e: + jmp .L__vlog2n + + + .align 16 +.L__near_one2: + cmp $3,%r9d + jnz .L__n1nb2 + + movapd %xmm0,%xmm8 + movapd p_x2(%rsp),%xmm0 + call .Lboth_nearone + movapd %xmm0,%xmm7 + movapd %xmm8,%xmm0 + jmp .L__vlog4n + + .align 16 +.L__n1nb2: + movapd %xmm0,%xmm8 + test $1,%r9d + jz .L__lnn22 + + movapd %xmm7,%xmm0 + movlpd p_x2(%rsp),%xmm0 + call .L__ln1 + movapd %xmm0,%xmm7 + +.L__lnn22: + test $2,%r9d # second number? + jz .L__lnn2e + movlpd %xmm7,p_x2(%rsp) + movlpd p_x2+8(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,p_x2+8(%rsp) + movapd p_x2(%rsp),%xmm7 + +.L__lnn2e: + movapd %xmm8,%xmm0 + jmp .L__vlog4n + + .align 16 + +.L__ln1: +# saves 10 cycles +# r = x - 1.0; + movlpd .L__real_two(%rip),%xmm2 + subsd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addsd %xmm0,%xmm2 + movsd %xmm0,%xmm1 + divsd %xmm2,%xmm1 # u + movlpd .L__real_ca4(%rip),%xmm4 #D + movlpd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movsd %xmm0,%xmm6 + mulsd %xmm1,%xmm6 # correction +# u = u + u; + addsd %xmm1,%xmm1 #u + movsd %xmm1,%xmm2 + mulsd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulsd %xmm1,%xmm5 # Cu + movsd %xmm1,%xmm3 + mulsd %xmm2,%xmm3 # u^3 + mulsd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulsd %xmm3,%xmm4 #Du^3 + + addsd .L__real_ca1(%rip),%xmm2 # +A + movsd %xmm3,%xmm1 + mulsd %xmm1,%xmm1 # u^6 + addsd %xmm4,%xmm5 #Cu+Du3 + + mulsd %xmm3,%xmm2 #u3(A+Bu2) + mulsd %xmm5,%xmm1 #u6(Cu+Du3) + addsd %xmm1,%xmm2 + subsd %xmm6,%xmm2 # -correction + + +# loge to log2 + movsd %xmm0,%xmm3 #r1 = r + pand .L__mask_lower(%rip),%xmm3 + subsd %xmm3,%xmm0 + addsd %xmm0,%xmm2 #r2 = r2 + (r - r1); + + movsd %xmm3,%xmm0 + movsd %xmm2,%xmm1 + + mulsd .L__real_log2e_tail(%rip),%xmm2 + mulsd .L__real_log2e_tail(%rip),%xmm0 + mulsd .L__real_log2e_lead(%rip),%xmm1 + mulsd .L__real_log2e_lead(%rip),%xmm3 + addsd %xmm2,%xmm0 + addsd %xmm1,%xmm0 + addsd %xmm3,%xmm0 + +# return r + r2; +# addsd %xmm2,%xmm0 + ret + + .align 16 + +# at least one of the numbers was a nan or infinity +.L__log_naninf: + test $1,%r8d # first number? + jz .L__lninf2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rdx + movlpd p_x(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninf2: + test $2,%r8d # second number? + jz .L__lninfe + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rdx + movlpd p_x+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninfe: + jmp .L__vlog1 # continue processing if not + +# at least one of the numbers was a nan or infinity +.L__log_naninf2: + movapd %xmm0,%xmm2 + test $1,%r10d # first number? + jz .L__lninf22 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm7,%xmm1 # save the inputs + mov p_x2(%rsp),%rdx + movlpd p_x2(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm7,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + movapd %xmm0,%xmm7 + +.L__lninf22: + test $2,%r10d # second number? + jz .L__lninfe2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2+8(%rsp),%rdx + movlpd p_x2+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninfe2: + movapd %xmm2,%xmm0 + jmp .L__vlog3 # continue processing if not + +# a subroutine to treat one number for nan/infinity +# the number is expected in rdx and returned in the low +# half of xmm0 +.L__lni: + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__lnan # jump if mantissa not zero, so it's a NaN +# inf + rcl $1,%rdx + jnc .L__lne2 # log(+inf) = inf +# negative x + movlpd .L__real_nan(%rip),%xmm0 + ret + +#NaN +.L__lnan: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__lne: + movd %rdx,%xmm0 +.L__lne2: + ret + + .align 16 + +# at least one of the numbers was a zero, a negative number, or both. +.L__z_or_n: + test $1,%r9d # first number? + jz .L__zn2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rax + call .L__zni + shufpd $2,%xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zn2: + test $2,%r9d # second number? + jz .L__zne + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zne: + jmp .L__vlog2 + +.L__z_or_n2: + movapd %xmm0,%xmm2 + test $1,%r11d # first number? + jz .L__zn22 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2(%rsp),%rax + call .L__zni + shufpd $2,%xmm7,%xmm0 + movapd %xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zn22: + test $2,%r11d # second number? + jz .L__zne2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zne2: + movapd %xmm2,%xmm0 + jmp .L__vlog4 +# a subroutine to treat one number for zero or negative values +# the number is expected in rax and returned in the low +# half of xmm0 +.L__zni: + shl $1,%rax + jnz .L__zn_x ## if just a carry, then must be negative + movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0 + ret +.L__zn_x: + movlpd .L__real_nan(%rip),%xmm0 + ret + + + + .data + .align 16 + +.L__real_one: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 # for alignment +.L__real_two: .quad 0x04000000000000000 # 1.0 + .quad 0x04000000000000000 +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0fff0000000000000 +.L__real_inf: .quad 0x07ff0000000000000 # +inf + .quad 0x07ff0000000000000 +.L__real_nan: .quad 0x07ff8000000000000 # NaN + .quad 0x07ff8000000000000 + +.L__real_zero: .quad 0x00000000000000000 # 0.0 + .quad 0x00000000000000000 + +.L__real_sign: .quad 0x08000000000000000 # sign bit + .quad 0x08000000000000000 +.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit + .quad 0x07ffFFFFFFFFFFFFF +.L__real_threshold: .quad 0x03F9EB85000000000 # .03 + .quad 0x03F9EB85000000000 +.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit + .quad 0x00008000000000000 +.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000FFFFFFFFFFFFF +.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */ + .quad 0x03f80000000000000 +.L__mask_1023: .quad 0x000000000000003ff # + .quad 0x000000000000003ff +.L__mask_040: .quad 0x00000000000000040 # + .quad 0x00000000000000040 +.L__mask_001: .quad 0x00000000000000001 # + .quad 0x00000000000000001 + +.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02 + .quad 0x03fb55555555554e6 +.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02 + .quad 0x03f89999999bac6d4 +.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03 + .quad 0x03f62492307f1519f +.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04 + .quad 0x03f3c8034c85dfff0 + + +.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02 + .quad 0x03fb5555555555557 +.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02 + .quad 0x03f89999999865ede +.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03 + .quad 0x03f6249423bd94741 +.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01 + .quad 0x03fe62e42e0000000 +.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08 + .quad 0x03e6efa39ef35793c + +.L__real_half: .quad 0x03fe0000000000000 # 1/2 + .quad 0x03fe0000000000000 +.L__real_log2e_lead: .quad 0x03FF7154400000000 # log2e_lead 1.44269180297851562500E+00 + .quad 0x03FF7154400000000 +.L__real_log2e_tail : .quad 0x03ECB295C17F0BBBE # log2e_tail 3.23791044778235969970E-06 + .quad 0x03ECB295C17F0BBBE +.L__mask_lower: .quad 0x0ffffffff00000000 + .quad 0x0ffffffff00000000 + .align 16 + +.L__np_ln_lead_table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02 + .quad 0x3f9f829800000000 # 3.07716131210327148438e-02 + .quad 0x3fa7745800000000 # 4.58095073699951171875e-02 + .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02 + .quad 0x3fb341d700000000 # 7.52233862876892089844e-02 + .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02 + .quad 0x3fba926d00000000 # 1.03796780109405517578e-01 + .quad 0x3fbe270700000000 # 1.17783010005950927734e-01 + .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01 + .quad 0x3fc2955280000000 # 1.45181953907012939453e-01 + .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01 + .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01 + .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01 + .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01 + .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01 + .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01 + .quad 0x3fce270700000000 # 2.35566020011901855469e-01 + .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01 + .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01 + .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01 + .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01 + .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01 + .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01 + .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01 + .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01 + .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01 + .quad 0x3fd686c800000000 # 3.51976394653320312500e-01 + .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01 + .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01 + .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01 + .quad 0x3fd9479400000000 # 3.94993782043457031250e-01 + .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01 + .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01 + .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01 + .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01 + .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01 + .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01 + .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01 + .quad 0x3fde744240000000 # 4.75845873355865478516e-01 + .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01 + .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01 + .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01 + .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01 + .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01 + .quad 0x3fe109f380000000 # 5.32464742660522460938e-01 + .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01 + .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01 + .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01 + .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01 + .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01 + .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01 + .quad 0x3fe307d720000000 # 5.94707071781158447266e-01 + .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01 + .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01 + .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01 + .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01 + .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01 + .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01 + .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01 + .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01 + .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01 + .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01 + .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01 + .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_tail_table: + .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00 + .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09 + .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08 + .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08 + .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08 + .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08 + .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08 + .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08 + .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08 + .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08 + .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08 + .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08 + .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08 + .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10 + .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08 + .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08 + .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08 + .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08 + .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08 + .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08 + .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08 + .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08 + .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08 + .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08 + .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09 + .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09 + .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08 + .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08 + .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08 + .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08 + .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09 + .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08 + .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08 + .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08 + .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08 + .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08 + .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09 + .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08 + .quad 0x03e33071282fb989b # 4.43021445893361960146e-09 + .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08 + .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08 + .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08 + .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08 + .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08 + .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09 + .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08 + .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08 + .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08 + .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08 + .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08 + .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08 + .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08 + .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08 + .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08 + .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08 + .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08 + .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08 + .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09 + .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08 + .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08 + .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08 + .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08 + .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08 + .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08 + .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08 + .quad 0 # for alignment +
diff --git a/src/gas/vrd4sin.S b/src/gas/vrd4sin.S new file mode 100644 index 0000000..b611dfd --- /dev/null +++ b/src/gas/vrd4sin.S
@@ -0,0 +1,2915 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# vrd4sin.s +# +# A vector implementation of the sin libm function. +# +# Prototype: +# +# __m128d,__m128d __vrd4_sin(__m128d x1, __m128d x2); +# +# Computes Sine of x for an array of input values. +# Places the results into the supplied y array. +# Does not perform error checking. +# Denormal inputs may produce unexpected results. +# This routine computes 4 double precision Sine values at a time. +# The four values are passed as packed doubles in xmm0 and xmm1. +# The four results are returned as packed doubles in xmm0 and xmm1. +# Note that this represents a non-standard ABI usage, as no ABI +# ( and indeed C) currently allows returning 2 values for a function. +# It is expected that some compilers may be able to take advantage of this +# interface when implementing vectorized loops. Using the array implementation +# of the routine requires putting the inputs into memory, and retrieving +# the results from memory. This routine eliminates the need for this +# overhead if the data does not already reside in memory. +# This routine is derived directly from the array version. +# Author: Harsha Jagasia +# Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 16 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.Lcosarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03fa5555555555555 + .quad 0x0bf56c16c16c16967 # -0.00138889 c2 + .quad 0x0bf56c16c16c16967 + .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3 + .quad 0x03efa01a019f4ec90 + .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4 + .quad 0x0be927e4fa17f65f6 + .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5 + .quad 0x03e21eeb69037ab78 + .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6 + .quad 0x0bda907db46cc5e42 +.Lsinarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bfc5555555555555 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03f81111111110bb3 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0bf2a01a019e83e5c + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03ec71de3796cde01 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0be5ae600b42fdfa7 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x03de5e0b2f9a43bb8 +.Lsincosarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x0bf56c16c16c16967 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x03efa01a019f4ec90 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x0be927e4fa17f65f6 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x03e21eeb69037ab78 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x0bda907db46cc5e42 +.Lcossinarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bf56c16c16c16967 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03efa01a019f4ec90 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0be927e4fa17f65f6 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03e21eeb69037ab78 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0bda907db46cc5e42 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + +.Levensin_oddcos_tbl: + .quad .Lsinsin_sinsin_piby4 # 0 + .quad .Lsinsin_sincos_piby4 # 1 + .quad .Lsinsin_cossin_piby4 # 2 + .quad .Lsinsin_coscos_piby4 # 3 + + .quad .Lsincos_sinsin_piby4 # 4 + .quad .Lsincos_sincos_piby4 # 5 + .quad .Lsincos_cossin_piby4 # 6 + .quad .Lsincos_coscos_piby4 # 7 + + .quad .Lcossin_sinsin_piby4 # 8 + .quad .Lcossin_sincos_piby4 # 9 + .quad .Lcossin_cossin_piby4 # 10 + .quad .Lcossin_coscos_piby4 # 11 + + .quad .Lcoscos_sinsin_piby4 # 12 + .quad .Lcoscos_sincos_piby4 # 13 + .quad .Lcoscos_cossin_piby4 # 14 + .quad .Lcoscos_coscos_piby4 # 15 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + .text + .align 16 + .p2align 4,,15 + +# define local variable storage offsets +.equ p_temp, 0x00 # temporary for get/put bits operation +.equ p_temp1, 0x10 # temporary for get/put bits operation + +.equ save_xmm6, 0x20 # temporary for get/put bits operation +.equ save_xmm7, 0x30 # temporary for get/put bits operation +.equ save_xmm8, 0x40 # temporary for get/put bits operation +.equ save_xmm9, 0x50 # temporary for get/put bits operation +.equ save_xmm10, 0x60 # temporary for get/put bits operation +.equ save_xmm11, 0x70 # temporary for get/put bits operation +.equ save_xmm12, 0x80 # temporary for get/put bits operation +.equ save_xmm13, 0x90 # temporary for get/put bits operation +.equ save_xmm14, 0x0A0 # temporary for get/put bits operation +.equ save_xmm15, 0x0B0 # temporary for get/put bits operation + +.equ r, 0x0C0 # pointer to r for remainder_piby2 +.equ rr, 0x0D0 # pointer to r for remainder_piby2 +.equ region, 0x0E0 # pointer to r for remainder_piby2 + +.equ r1, 0x0F0 # pointer to r for remainder_piby2 +.equ rr1, 0x0100 # pointer to r for remainder_piby2 +.equ region1, 0x0110 # pointer to r for remainder_piby2 + +.equ p_temp2, 0x0120 # temporary for get/put bits operation +.equ p_temp3, 0x0130 # temporary for get/put bits operation + +.equ p_temp4, 0x0140 # temporary for get/put bits operation +.equ p_temp5, 0x0150 # temporary for get/put bits operation + +.equ p_original, 0x0160 # original x +.equ p_mask, 0x0170 # original x +.equ p_sign, 0x0180 # original x + +.equ p_original1, 0x0190 # original x +.equ p_mask1, 0x01A0 # original x +.equ p_sign1, 0x01B0 # original x + +.equ save_r12, 0x01C0 # temporary for get/put bits operation +.equ save_r13, 0x01D0 # temporary for get/put bits operation + +.globl __vrd4_sin + .type __vrd4_sin,@function +__vrd4_sin: + + sub $0x1E8,%rsp + mov %r12,save_r12(%rsp) # save r12 + mov %r13,save_r13(%rsp) # save r13 + +#DEBUG +# jmp .Lfinal_check +#DEBUG + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#STARTMAIN + +movdqa %xmm0,%xmm6 +movdqa %xmm1,%xmm7 +movapd .L__real_7fffffffffffffff(%rip),%xmm2 + +andpd %xmm2,%xmm0 #Unsign +andpd %xmm2,%xmm1 #Unsign + +movd %xmm0,%rax #rax is lower arg +movhpd %xmm0, p_temp+8(%rsp) +mov p_temp+8(%rsp),%rcx #rcx = upper arg +movd %xmm1,%r8 #r8 is lower arg +movhpd %xmm1, p_temp1+8(%rsp) +mov p_temp1+8(%rsp),%r9 #r9 = upper arg + +movdqa %xmm0,%xmm12 +movdqa %xmm1,%xmm13 + +pcmpgtd %xmm6,%xmm12 +pcmpgtd %xmm7,%xmm13 +movdqa %xmm12,%xmm6 +movdqa %xmm13,%xmm7 +psrldq $4,%xmm12 +psrldq $4,%xmm13 +psrldq $8,%xmm6 +psrldq $8,%xmm7 + +mov $0x3FE921FB54442D18,%rdx #piby4 + +mov $0x411E848000000000,%r10 #5e5 + +movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + + +por %xmm6,%xmm12 +por %xmm7,%xmm13 +movd %xmm12,%r12 #Move Sign to gpr ** +movd %xmm13,%r13 #Move Sign to gpr ** + +movapd %xmm0,%xmm2 #x0 +movapd %xmm1,%xmm3 #x1 +movapd %xmm0,%xmm6 #x0 +movapd %xmm1,%xmm7 #x1 + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm2 = x, xmm4 =0.5/t, xmm6 =x +# xmm3 = x, xmm5 =0.5/t, xmm7 =x +.align 16 +.Leither_or_both_arg_gt_than_piby4: + cmp %r10,%rax + jae .Lfirst_or_next3_arg_gt_5e5 + + cmp %r10,%rcx + jae .Lsecond_or_next2_arg_gt_5e5 + + cmp %r10,%r8 + jae .Lthird_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfourth_arg_gt_5e5 + + +# /* Find out what multiple of piby2 */ +# npi2 = (int)(x * twobypi + 0.5); + movapd .L__real_3fe45f306dc9c883(%rip),%xmm0 + mulpd %xmm0,%xmm2 # * twobypi + mulpd %xmm0,%xmm3 # * twobypi + + addpd %xmm4,%xmm2 # +0.5, npi2 + addpd %xmm4,%xmm3 # +0.5, npi2 + + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + + cvtdq2pd %xmm4,%xmm2 # and back to double. + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + + movd %xmm4,%r8 # Region + movd %xmm5,%r9 # Region + + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + mov %r8,%r10 + mov %r9,%r11 + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm1,%xmm7 # t-rhead + + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = npi2 (int), xmm0 =rhead, xmm8 =rtail +# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + mov %r10,%rax + mov %r11,%rcx + + not %r12 #ADDED TO CHANGE THE LOGIC + not %r13 #ADDED TO CHANGE THE LOGIC + and %r12,%r10 + and %r13,%r11 + + not %rax + not %rcx + not %r12 + not %r13 + and %r12,%rax + and %r13,%rcx + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r12,p_sign+8(%rsp) #write out upper sign bit + mov %r11,p_sign1(%rsp) #write out lower sign bit + mov %r13,p_sign1+8(%rsp) #write out upper sign bit + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead +# xmm4 = Sign, xmm0 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail + movapd %xmm0,%xmm6 # rhead + movapd %xmm1,%xmm7 # rhead + + subpd %xmm8,%xmm0 # r = rhead - rtail + subpd %xmm9,%xmm1 # r = rhead - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail + + subpd %xmm0,%xmm6 #rr=rhead-r + subpd %xmm1,%xmm7 #rr=rhead-r + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm0,%xmm2 # move r for r2 + movapd %xmm1,%xmm3 # move r for r2 + + mulpd %xmm0,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + subpd %xmm8,%xmm6 #rr=(rhead-r) -rtail + subpd %xmm9,%xmm7 #rr=(rhead-r) -rtail + + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + +#DEBUG +# jmp .Lfinal_check +#DEBUG + + leaq .Levensin_oddcos_tbl(%rip),%rsi + jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfirst_or_next3_arg_gt_5e5: +# %rcx,,%rax r8, r9 + + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Be sure not to use %xmm3,%xmm1 and xmm7 +# Use %xmm8,,%xmm5 xmm10, xmm12 +# %xmm11,,%xmm9 xmm13 + + + movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm6,%xmm6 + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1 + cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm6,%xmm0 + subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm6 # rr=rhead-r + subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm0,r+8(%rsp) # store upper r + movlpd %xmm6,rr+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_lower_naninf + + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrd4_sin_lower_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) # rr = 0 + mov %r10d,region(%rsp) # region =0 + +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + + movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_lower_naninf_of_both_gt_5e5 + + mov %rcx,p_temp(%rsp) #Save upper arg + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + call __amd_remainder_piby2@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + mov p_temp(%rsp),%rcx #Restore upper arg + jmp 0f + +.L__vrd4_sin_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf +# mov p_original(r%sp),%rax + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) #rr = 0 + mov %r10d,region(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_upper_naninf_of_both_gt_5e5 + + + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrd4_sin_upper_naninf_of_both_gt_5e5: +# mov p_original+8(%rsp),%rcx ;upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) #rr = 0 + mov %r10d,region+4(%rsp) #region = 0 + +.align 16 +0: + jmp .Lcheck_next2_args + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsecond_or_next2_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Restore xmm4 and %xmm3,,%xmm1 xmm7 +# Can use %xmm10,,%xmm8 xmm12 +# %xmm9,,%xmm5 xmm11, xmm13 + + movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call +# movlhps %xmm0,%xmm0 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case +# movlhps %xmm2,%xmm2 +# movlhps %xmm6,%xmm6 + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1 + cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm1 = piby2_2 + cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store upper region + movsd %xmm6,%xmm0 + subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm6 # rr=rhead-r + subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm0,r(%rsp) # store upper r + movlpd %xmm6,rr(%rsp) # store upper rr + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_upper_naninf + + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrd4_sin_upper_naninf: +# mov p_original+8(%rsp),%rcx ; upper arg is nan/inf +# mov r+8(%rsp),%rcx ; upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) # rr = 0 + mov %r10d,region+4(%rsp) # region =0 + +.align 16 +0: + + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcheck_next2_args: + + mov $0x411E848000000000,%r10 #5e5 + + + cmp %r10,%r8 + jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfirst_second_done_fourth_arg_gt_5e5 + + + +# Work on next two args, both < 5e5 +# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5 + + movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi + addpd %xmm4,%xmm3 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm5,region1(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm1,%xmm7 # t-rhead + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm1,%xmm7 # rhead + subpd %xmm9,%xmm1 # r = rhead - rtail + movapd %xmm1,r1(%rsp) + + subpd %xmm1,%xmm7 # rr=rhead-r + subpd %xmm9,%xmm7 # rr=(rhead-r) -rtail + movapd %xmm7,rr1(%rsp) + + jmp .L__vrd4_sin_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lthird_or_fourth_arg_gt_5e5: +#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Can use %xmm11,,%xmm9 xmm13 +# %xmm8,,%xmm5 xmm10, xmm12 +# Restore xmm4 + +# Work on first two args, both < 5e5 + + + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm4,region(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm0,%xmm6 # rhead + subpd %xmm8,%xmm0 # r = rhead - rtail + movapd %xmm0,r(%rsp) + + subpd %xmm0,%xmm6 # rr=rhead-r + subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail + movapd %xmm6,rr(%rsp) + + +# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_third_or_fourth_arg_gt_5e5: +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + + mov $0x411E848000000000,%r10 #5e5 + + cmp %r10,%r9 + jae .Lboth_arg_gt_5e5_higher + + +# Upper Arg is <5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg + movhlps %xmm3,%xmm3 + movhlps %xmm7,%xmm7 + + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r9d,region1+4(%rsp) # store upper region + movsd %xmm7,%xmm1 + subsd %xmm0,%xmm1 # xmm1 = r=(rhead-rtail) + subsd %xmm1,%xmm7 # rr=rhead-r + subsd %xmm0,%xmm7 # xmm7 = rr=((rhead-r) -rtail) + movlpd %xmm1,r1+8(%rsp) # store upper r + movlpd %xmm7,rr1+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_lower_naninf_higher + + lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf + lea rr1(%rsp),%rsi + lea r1(%rsp),%rdi + movlpd r1(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_sin_lower_naninf_higher: +# mov p_original1(%rsp),%r8 ; upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1(%rsp) # rr = 0 + mov %r10d,region1(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrd4_sin_reconstruct + +.align 16 +.Lboth_arg_gt_5e5_higher: +# Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher + + mov %r9,p_temp1(%rsp) #Save upper arg + lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf + lea rr1(%rsp),%rsi + lea r1(%rsp),%rdi + movsd %xmm1,%xmm0 + call __amd_remainder_piby2@PLT + mov p_temp1(%rsp),%r9 #Restore upper arg + jmp 0f + +.L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf +# mov p_original1(%rsp),%r8 + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1(%rsp) #rr = 0 + mov %r10d,region1(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher + + lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea rr1+8(%rsp),%rsi + lea r1+8(%rsp),%rdi + movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher: +# mov p_original1+8(%rsp),%r9 ;upper arg is nan/inf +# movd %xmm6,%r9 ;upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1+8(%rsp) #rr = 0 + mov %r10d,region1+4(%rsp) #region = 0 + +.align 16 +0: + jmp .L__vrd4_sin_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfourth_arg_gt_5e5: +#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5 +#%rcx,,%rax r8, r9 +#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + +# Work on first two args, both < 5e5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm4,region(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm0,%xmm6 # rhead + subpd %xmm8,%xmm0 # r = rhead - rtail + movapd %xmm0,r(%rsp) + + subpd %xmm0,%xmm6 # rr=rhead-r + subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail + movapd %xmm6,rr(%rsp) + + +# Work on next two args, third arg < 5e5, fourth arg >= 5e5 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_fourth_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call +# movlhps %xmm1,%xmm1 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case +# movlhps %xmm3,%xmm3 +# movlhps %xmm7,%xmm7 + + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r8d,region1(%rsp) # store lower region + movsd %xmm7,%xmm1 + subsd %xmm0,%xmm1 # xmm0 = r=(rhead-rtail) + subsd %xmm1,%xmm7 # rr=rhead-r + subsd %xmm0,%xmm7 # xmm6 = rr=((rhead-r) -rtail) + + movlpd %xmm1,r1(%rsp) # store upper r + movlpd %xmm7,rr1(%rsp) # store upper rr + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_upper_naninf_higher + + lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea rr1+8(%rsp),%rsi + lea r1+8(%rsp),%rdi + movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_sin_upper_naninf_higher: +# mov p_original1+8(%rsp),%r9 ; upper arg is nan/inf +# mov r1+8(%rsp),%r9 ; upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1+8(%rsp) # rr = 0 + mov %r10d,region1+4(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrd4_sin_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrd4_sin_reconstruct: +#Results +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd r(%rsp),%xmm0 + movapd r1(%rsp),%xmm1 + + movapd rr(%rsp),%xmm6 + movapd rr1(%rsp),%xmm7 + + mov region(%rsp),%r8 + mov region1(%rsp),%r9 + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + + mov %r8,%r10 + mov %r9,%r11 + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + mov %r10,%rax + mov %r11,%rcx + + not %r12 #ADDED TO CHANGE THE LOGIC + not %r13 #ADDED TO CHANGE THE LOGIC + and %r12,%r10 + and %r13,%r11 + + not %rax + not %rcx + not %r12 + not %r13 + and %r12,%rax + and %r13,%rcx + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r12,p_sign+8(%rsp) #write out upper sign bit + mov %r11,p_sign1(%rsp) #write out lower sign bit + mov %r13,p_sign1+8(%rsp) #write out upper sign bit + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm0,%xmm2 + movapd %xmm1,%xmm3 + + mulpd %xmm0,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + leaq .Levensin_oddcos_tbl(%rip),%rsi + jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrd4_sin_cleanup: + + movapd p_sign(%rsp),%xmm0 + movapd p_sign1(%rsp),%xmm1 + xorpd %xmm4,%xmm0 # (+) Sign + xorpd %xmm5,%xmm1 # (+) Sign + +.Lfinal_check: + mov save_r12(%rsp),%r12 # restore r12 + mov save_r13(%rsp),%r13 # restore r13 + + add $0x1E8,%rsp + ret + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_coscos_piby4: + + + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 ;trash r + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm2 # x4 recalculate + mulpd %xmm3,%xmm3 # x4 recalculate + + movapd p_temp2(%rsp),%xmm12 # r + movapd p_temp3(%rsp),%xmm13 # r + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm1,%xmm7 # x * xx + + subpd %xmm12,%xmm10 # (1 + (-t)) - r + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + + subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1 + subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1 + + subpd %xmm12,%xmm4 # + t + subpd %xmm13,%xmm5 # + t + + jmp .L__vrd4_sin_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcossin_cossin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + + movdqa .Lsincosarray+0x50(%rip),%xmm4 # s6 + movdqa .Lsincosarray+0x50(%rip),%xmm5 # s6 + movapd .Lsincosarray+0x20(%rip),%xmm8 # s3 + movapd .Lsincosarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + movhlps %xmm10,%xmm10 # move high x4 for cos term + movhlps %xmm11,%xmm11 # move high x4 for cos term + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term + movsd %xmm3,%xmm7 # move low x2 for x3 for sin term + mulsd %xmm0,%xmm6 # get low x3 for sin term + mulsd %xmm1,%xmm7 # get low x3 for sin term + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high r for cos + movhlps %xmm3,%xmm13 # move high r for cos + movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos + movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm7,%xmm5 # sin *x3 + mulsd %xmm10,%xmm8 # cos *x4 + mulsd %xmm11,%xmm9 # cos *x4 + + mulsd p_temp(%rsp),%xmm2 # 0.5 * x2 * xx for sin term + mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term + movsd %xmm12,%xmm6 # Keep high r for cos term + movsd %xmm13,%xmm7 # Keep high r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0 + + subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx + subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx + + movhlps %xmm0,%xmm10 # move high x for x*xx for cos term + movhlps %xmm1,%xmm11 # move high x for x*xx for cos term + + mulsd p_temp+8(%rsp),%xmm10 # x * xx + mulsd p_temp1+8(%rsp),%xmm11 # x * xx + + movsd %xmm12,%xmm2 # move -t for cos term + movsd %xmm13,%xmm3 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm12 #1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm13 #1+(-t) + addsd p_temp(%rsp),%xmm4 # sin+xx + addsd p_temp1(%rsp),%xmm5 # sin+xx + subsd %xmm6,%xmm12 # (1-t) - r + subsd %xmm7,%xmm13 # (1-t) - r + subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx + subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx + addsd %xmm0,%xmm4 # sin + x + addsd %xmm1,%xmm5 # sin + x + addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx) + addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx) + subsd %xmm2,%xmm8 # cos+t + subsd %xmm3,%xmm9 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + jmp .L__vrd4_sin_cleanup + +.align 16 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lsincos_cossin_piby4: # changed from sincos_sincos + # xmm1 is cossin and xmm0 is sincos + + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + movapd %xmm1,p_temp3(%rsp) # Store r for the sincos term + + movapd .Lsincosarray+0x50(%rip),%xmm4 # s6 + movapd .Lcossinarray+0x50(%rip),%xmm5 # s6 + movdqa .Lsincosarray+0x20(%rip),%xmm8 # s3 + movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + movhlps %xmm10,%xmm10 # move high x4 for cos term + + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin) + movhlps %xmm3,%xmm7 # move high x2 for x3 for sin term (sincos) + + mulsd %xmm0,%xmm6 # get low x3 for sin term + mulsd p_temp3+8(%rsp),%xmm7 # get high x3 for sin term + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high r for cos (cossin) + movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term (sincos) + + movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin) + movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos) + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm11,%xmm5 # cos *x4 + mulsd %xmm10,%xmm8 # cos *x4 + mulsd %xmm7,%xmm9 # sin *x3 + + mulsd p_temp(%rsp),%xmm2 # low 0.5 * x2 * xx for sin term (cossin) + mulsd p_temp1+8(%rsp),%xmm13 # high 0.5 * x2 * xx for sin term (sincos) + + movsd %xmm12,%xmm6 # Keep high r for cos term + movsd %xmm3,%xmm7 # Keep low r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0 + + subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx (cossin) + subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx (sincos) + + movhlps %xmm0,%xmm10 # move high x for x*xx for cos term (cossin) + movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos) + + mulsd p_temp+8(%rsp),%xmm10 # x * xx + mulsd p_temp1(%rsp),%xmm1 # x * xx + + movsd %xmm12,%xmm2 # move -t for cos term + movsd %xmm3,%xmm13 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm12 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t) + + addsd p_temp(%rsp),%xmm4 # sin+xx + + addsd p_temp1+8(%rsp),%xmm9 # sin+xx + + + subsd %xmm6,%xmm12 # (1-t) - r + subsd %xmm7,%xmm3 # (1-t) - r + + subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx + subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx + + addsd %xmm0,%xmm4 # sin + x + + addsd %xmm11,%xmm9 # sin + x + + + addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx) + addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx) + + subsd %xmm2,%xmm8 # cos+t + subsd %xmm13,%xmm5 # cos+t + + movlhps %xmm8,%xmm4 # cossin + movlhps %xmm9,%xmm5 # sincos + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lsincos_sincos_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + movapd %xmm0,p_temp2(%rsp) # Store r + movapd %xmm1,p_temp3(%rsp) # Store r + + + movapd .Lcossinarray+0x50(%rip),%xmm4 # s6 + movapd .Lcossinarray+0x50(%rip),%xmm5 # s6 + movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3 + movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term + movhlps %xmm3,%xmm7 # move low x2 for x3 for sin term + mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term + mulsd p_temp3+8(%rsp),%xmm7 # get low x3 for sin term + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term + movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term + # Reverse 12 and 2 + + movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos + movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos + + mulsd %xmm6,%xmm8 # sin *x3 + mulsd %xmm7,%xmm9 # sin *x3 + mulsd %xmm10,%xmm4 # cos *x4 + mulsd %xmm11,%xmm5 # cos *x4 + + mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term + mulsd p_temp1+8(%rsp),%xmm13 # 0.5 * x2 * xx for sin term + movsd %xmm2,%xmm6 # Keep high r for cos term + movsd %xmm3,%xmm7 # Keep high r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 #-t=r-1.0 + + subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx + subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx + + movhlps %xmm0,%xmm10 # move high x for x for sin term + movhlps %xmm1,%xmm11 # move high x for x for sin term + # Reverse 10 and 0 + + mulsd p_temp(%rsp),%xmm0 # x * xx + mulsd p_temp1(%rsp),%xmm1 # x * xx + + movsd %xmm2,%xmm12 # move -t for cos term + movsd %xmm3,%xmm13 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t) + addsd p_temp+8(%rsp),%xmm8 # sin+xx + addsd p_temp1+8(%rsp),%xmm9 # sin+xx + + subsd %xmm6,%xmm2 # (1-t) - r + subsd %xmm7,%xmm3 # (1-t) - r + + subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx + subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx + + addsd %xmm10,%xmm8 # sin + x + addsd %xmm11,%xmm9 # sin + x + + addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx) + addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx) + + subsd %xmm12,%xmm4 # cos+t + subsd %xmm13,%xmm5 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lcossin_sincos_piby4: # changed from sincos_sincos + # xmm1 is cossin and xmm0 is sincos +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + movapd %xmm0,p_temp2(%rsp) # Store r + + + movapd .Lcossinarray+0x50(%rip),%xmm4 # s6 + movapd .Lsincosarray+0x50(%rip),%xmm5 # s6 + movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3 + movdqa .Lsincosarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + movhlps %xmm11,%xmm11 # move high x4 for cos term + + + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term + movsd %xmm3,%xmm7 # move low x2 for x3 for sin term + + mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term + mulsd %xmm1,%xmm7 # get low x3 for sin term + + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term + movhlps %xmm3,%xmm13 # move high r for cos + + movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos + movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin + + mulsd %xmm6,%xmm8 # sin *x3 + mulsd %xmm11,%xmm9 # cos *x4 + mulsd %xmm10,%xmm4 # cos *x4 + mulsd %xmm7,%xmm5 # sin *x3 + + mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term + mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term + + movsd %xmm2,%xmm6 # Keep high r for cos term + movsd %xmm13,%xmm7 # Keep high r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0 + + subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx + subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx + + movhlps %xmm0,%xmm10 # move high x for x for sin term + movhlps %xmm1,%xmm11 # move high x for x*xx for cos term + + mulsd p_temp(%rsp),%xmm0 # x * xx + mulsd p_temp1+8(%rsp),%xmm11 # x * xx + + movsd %xmm2,%xmm12 # move -t for cos term + movsd %xmm13,%xmm3 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm13 # 1+(-t) + + addsd p_temp+8(%rsp),%xmm8 # sin+xx + addsd p_temp1(%rsp),%xmm5 # sin+xx + + subsd %xmm6,%xmm2 # (1-t) - r + subsd %xmm7,%xmm13 # (1-t) - r + + subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx + subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx + + + addsd %xmm10,%xmm8 # sin + x + addsd %xmm1,%xmm5 # sin + x + + addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx) + addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx) + + subsd %xmm12,%xmm4 # cos+t + subsd %xmm3,%xmm9 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + jmp .L__vrd4_sin_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_sinsin_piby4: + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + movapd %xmm2,p_temp2(%rsp) # store x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + movapd %xmm11,p_temp3(%rsp) # store r + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + mulpd %xmm2,%xmm10 # x4 + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + + movapd %xmm2,%xmm12 # copy of x2 for 0.5*x2 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) + + mulpd %xmm2,%xmm10 # x6 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd .L__real_3fe0000000000000(%rip),%xmm12 # 0.5 *x2 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm0,%xmm2 # x3 recalculate + mulpd %xmm3,%xmm3 # x4 recalculate + + movapd p_temp3(%rsp),%xmm13 # r + + mulpd %xmm6,%xmm12 # 0.5 * x2 *xx + mulpd %xmm1,%xmm7 # x * xx + + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # x4 * zc + + subpd %xmm12,%xmm4 # -0.5 * x2 *xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm6,%xmm4 # x3 * zs +xx + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + + subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1 + addpd %xmm0,%xmm4 # +x + subpd %xmm13,%xmm5 # + t + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lsinsin_coscos_piby4: + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + movapd %xmm3,p_temp3(%rsp) # store x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + movapd %xmm10,p_temp2(%rsp) # store r + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + mulpd %xmm3,%xmm11 # x4 + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for 0.5*x2 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm11 # x6 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd .L__real_3fe0000000000000(%rip),%xmm13 # 0.5 *x2 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zs + + mulpd %xmm2,%xmm2 # x4 recalculate + mulpd %xmm1,%xmm3 # x3 recalculate + + movapd p_temp2(%rsp),%xmm12 # r + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm7,%xmm13 # 0.5 * x2 *xx + subpd %xmm12,%xmm10 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x3 * zs + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx;;;;;;;;;;;;;;;;;;;;; + subpd %xmm13,%xmm5 # -0.5 * x2 *xx + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm7,%xmm5 # +xx + subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1 + addpd %xmm1,%xmm5 # +x + subpd %xmm12,%xmm4 # + t + + jmp .L__vrd4_sin_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_cossin_piby4: #Derive from cossin_coscos + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movapd .Lsincosarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm6,p_temp(%rsp) # rr + movhlps %xmm10,%xmm10 # get upper r for t for cos + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + movsd %xmm0,%xmm8 # lower x for sin + mulsd %xmm2,%xmm8 # lower x3 for sin + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # upper x4 for cos + movsd %xmm8,%xmm2 # lower x3 for sin + + movsd %xmm6,%xmm9 # lower xx + # note using odd reg + + movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term + movapd p_temp3(%rsp),%xmm13 # r + + mulpd %xmm0,%xmm6 # x * xx for upper cos term + mulpd %xmm1,%xmm7 # x * xx + movhlps %xmm6,%xmm6 + mulsd p_temp2(%rsp),%xmm9 # xx * 0.5*x2 for sin term + + subsd %xmm12,%xmm10 # (1 + (-t)) - r + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + # x3 * zs + + movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin + + subsd %xmm9,%xmm4 # x3zs - 0.5*x2*xx + + subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp(%rsp),%xmm4 # +xx + + + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + subsd %xmm12,%xmm8 # + t + addsd %xmm0,%xmm4 # +x + subpd %xmm13,%xmm5 # + t + + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lcoscos_sincos_piby4: #Derive from sincos_coscos + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movapd .Lcossinarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm6,p_temp(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zszc + addpd %xmm9,%xmm5 # z + + mulpd %xmm0,%xmm2 # upper x3 for sin + mulsd %xmm0,%xmm2 # lower x4 for cos + mulpd %xmm3,%xmm3 # x4 + + movhlps %xmm6,%xmm9 # upper xx for sin term + # note using odd reg + + movlpd p_temp2(%rsp),%xmm12 # lower r for cos term + movapd p_temp3(%rsp),%xmm13 # r + + + mulpd %xmm0,%xmm6 # x * xx for lower cos term + mulpd %xmm1,%xmm7 # x * xx + + mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term + + subsd %xmm12,%xmm10 # (1 + (-t)) - r + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + mulpd %xmm3,%xmm5 + # x4 * zc + + movhlps %xmm4,%xmm8 # xmm8= sin, xmm4= cos + subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx + + + subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp+8(%rsp),%xmm8 # +xx + + movhlps %xmm0,%xmm0 # upper x for sin + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + subsd %xmm12,%xmm4 # + t + subpd %xmm13,%xmm5 # + t + addsd %xmm0,%xmm8 # +x + + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lcossin_coscos_piby4: + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsincosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + movhlps %xmm11,%xmm11 # get upper r for t for cos + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zcs + + movsd %xmm1,%xmm9 # lower x for sin + mulsd %xmm3,%xmm9 # lower x3 for sin + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # upper x4 for cos + movsd %xmm9,%xmm3 # lower x3 for sin + + movsd %xmm7,%xmm8 # lower xx + # note using even reg + + movapd p_temp2(%rsp),%xmm12 # r + movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm1,%xmm7 # x * xx for upper cos term + movhlps %xmm7,%xmm7 + mulsd p_temp3(%rsp),%xmm8 # xx * 0.5*x2 for sin term + + subpd %xmm12,%xmm10 # (1 + (-t)) - r + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + # x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin + + subsd %xmm8,%xmm5 # x3zs - 0.5*x2*xx + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1(%rsp),%xmm5 # +xx + + + subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1 + subsd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1 + + subpd %xmm12,%xmm4 # + t + subsd %xmm13,%xmm9 # + t + addsd %xmm1,%xmm5 # +x + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lcossin_sinsin_piby4: # Derived from sincos_sinsin + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lsincosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + + movhlps %xmm11,%xmm11 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm6,%xmm10 # 0.5*x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zczs + + movsd %xmm3,%xmm12 + mulsd %xmm1,%xmm12 # low x3 for sin + + mulpd %xmm0, %xmm2 # x3 + mulpd %xmm3, %xmm3 # high x4 for cos + movsd %xmm12,%xmm3 # low x3 for sin + + movhlps %xmm1,%xmm8 # upper x for cos term + # note using even reg + movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term + + mulsd p_temp1+8(%rsp),%xmm8 # x * xx for upper cos term + + mulsd p_temp3(%rsp),%xmm7 # xx * 0.5*x2 for lower sin term + + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin + + subsd %xmm7,%xmm5 # x3zs - 0.5*x2*xx + + subsd %xmm8,%xmm11 # ((1 + (-t)) - r) - x*xx + + subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx + addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1(%rsp),%xmm5 # +xx + + addpd %xmm6,%xmm4 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + + addsd %xmm1,%xmm5 # +x + addpd %xmm0,%xmm4 # +x + subsd %xmm13,%xmm9 # + t + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lsincos_coscos_piby4: + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lcossinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm1,%xmm3 # upper x3 for sin + mulsd %xmm1,%xmm3 # lower x4 for cos + + movhlps %xmm7,%xmm8 # upper xx for sin term + # note using even reg + + movapd p_temp2(%rsp),%xmm12 # r + movlpd p_temp3(%rsp),%xmm13 # lower r for cos term + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm1,%xmm7 # x * xx for lower cos term + + mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term + + subpd %xmm12,%xmm10 # (1 + (-t)) - r + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos + + subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1+8(%rsp),%xmm9 # +xx + + movhlps %xmm1,%xmm1 # upper x for sin + subpd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + subpd %xmm12,%xmm4 # + t + subsd %xmm13,%xmm5 # + t + addsd %xmm1, %xmm9 # +x + + movlhps %xmm9, %xmm5 + + jmp .L__vrd4_sin_cleanup + + +.align 16 +.Lsincos_sinsin_piby4: # Derived from sincos_coscos + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lcossinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm6,%xmm10 # 0.5x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm0,%xmm2 # x3 + mulpd %xmm1,%xmm3 # upper x3 for sin + mulsd %xmm1,%xmm3 # lower x4 for cos + + movhlps %xmm7,%xmm8 # upper xx for sin term + # note using even reg + + movlpd p_temp3(%rsp),%xmm13 # lower r for cos term + + mulpd %xmm1,%xmm7 # x * xx for lower cos term + + mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term + + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos + + subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx + + subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx + addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1+8(%rsp),%xmm9 # +xx + + movhlps %xmm1,%xmm1 # upper x for sin + addpd %xmm6,%xmm4 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + addsd %xmm1,%xmm9 # +x + addpd %xmm0,%xmm4 # +x + subsd %xmm13,%xmm5 # + t + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_sin_cleanup + + +.align 16 +.Lsinsin_cossin_piby4: # Derived from sincos_sinsin + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lsincosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # x2 + movapd %xmm6,p_temp(%rsp) # xx + + movhlps %xmm10,%xmm10 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm7,%xmm11 # 0.5*x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zs + + + movsd %xmm2,%xmm13 + mulsd %xmm0,%xmm13 # low x3 for sin + + mulpd %xmm1,%xmm3 # x3 + mulpd %xmm2,%xmm2 # high x4 for cos + movsd %xmm13,%xmm2 # low x3 for sin + + + movhlps %xmm0,%xmm9 # upper x for cos term ; note using even reg + movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term + mulsd p_temp+8(%rsp),%xmm9 # x * xx for upper cos term + mulsd p_temp2(%rsp),%xmm6 # xx * 0.5*x2 for lower sin term + subsd %xmm12,%xmm10 # (1 + (-t)) - r + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin + subsd %xmm6,%xmm4 # x3zs - 0.5*x2*xx + + subsd %xmm9,%xmm10 # ((1 + (-t)) - r) - x*xx + + subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx + + addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp(%rsp),%xmm4 # +xx + + addpd %xmm7,%xmm5 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + + addsd %xmm0,%xmm4 # +x + addpd %xmm1,%xmm5 # +x + subsd %xmm12,%xmm8 # + t + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lsinsin_sincos_piby4: # Derived from sincos_coscos + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcossinarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm6,p_temp(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm7,%xmm11 # 0.5x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm1,%xmm3 # x3 + mulpd %xmm0,%xmm2 # upper x3 for sin + mulsd %xmm0,%xmm2 # lower x4 for cos + + movhlps %xmm6,%xmm9 # upper xx for sin term + # note using even reg + + movlpd p_temp2(%rsp),%xmm12 # lower r for cos term + + mulpd %xmm0,%xmm6 # x * xx for lower cos term + + mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term + + subsd %xmm12,%xmm10 # (1 + (-t)) - r + + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm4,%xmm8 # xmm9= sin, xmm5= cos + + subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx + + subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + + subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx + addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp+8(%rsp),%xmm8 # +xx + + movhlps %xmm0,%xmm0 # upper x for sin + addpd %xmm7,%xmm5 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + + + addsd %xmm0,%xmm8 # +x + addpd %xmm1,%xmm5 # +x + subsd %xmm12,%xmm4 # + t + + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_sin_cleanup + + +.align 16 +.Lsinsin_sinsin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +#DEBUG +# xorpd %xmm0, %xmm0 +# xorpd %xmm1, %xmm1 +# jmp .Lfinal_check +#DEBUG + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + movapd %xmm2,p_temp2(%rsp) # copy of x2 + movapd %xmm3,p_temp3(%rsp) # copy of x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm2,%xmm10 # x6 + mulpd %xmm3,%xmm11 # x6 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5 *x2 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm6,%xmm2 # 0.5 * x2 *xx + mulpd %xmm7,%xmm3 # 0.5 * x2 *xx + + mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zs + + movapd p_temp2(%rsp),%xmm10 # x2 + movapd p_temp3(%rsp),%xmm11 # x2 + + mulpd %xmm0,%xmm10 # x3 + mulpd %xmm1,%xmm11 # x3 + + mulpd %xmm10,%xmm4 # x3 * zs + mulpd %xmm11,%xmm5 # x3 * zs + + subpd %xmm2,%xmm4 # -0.5 * x2 *xx + subpd %xmm3,%xmm5 # -0.5 * x2 *xx + + addpd %xmm6,%xmm4 # +xx + addpd %xmm7,%xmm5 # +xx + + addpd %xmm0,%xmm4 # +x + addpd %xmm1,%xmm5 # +x + + jmp .L__vrd4_sin_cleanup
diff --git a/src/gas/vrda_scaled_logr.S b/src/gas/vrda_scaled_logr.S new file mode 100644 index 0000000..9d1bdc1 --- /dev/null +++ b/src/gas/vrda_scaled_logr.S
@@ -0,0 +1,2428 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrda_scaled_logr.s +# +# An array implementation of the log libm function. +# Adapted to provide a scalingi and shifting factor. This routine is +# used by the ACML RNG distribution functions. +# +# Prototype: +# +# void vrda_scaled_logr(int n, double *x, double *y, double b); +# +# Computes the natural log of x multiplied by a. +# A reduced precision routine. Uses the intel novel reduction technique +# with frcpai to compute logs. +# Also uses only 3 polynomial terms to acheive52-18= 34 significant digits +# +# This specialized routine does not handle negative numbers, 0, NaNs, or infinity. +# This routine is not C99 compliant +# This version can compute logs in 26 +# cycles with n <= 24 +# +# + + +# define local variable storage offsets +.equ p_x,0 # temporary for error checking operation +.equ p_idx,0x010 # index storage +.equ p_xexp,0x020 # index storage + +.equ p_x2,0x030 # temporary for error checking operation +.equ p_idx2,0x040 # index storage +.equ p_xexp2,0x050 # index storage + +.equ save_xa,0x060 #qword +.equ save_ya,0x068 #qword +.equ save_nv,0x070 #qword +.equ p_iter,0x078 # qword storage for number of loop iterations + + + +.equ p2_temp,0x090 # second temporary for get/put bits operation + +.equ stack_size,0x0e8 + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + .weak vrda_scaled_logr__ + .set vrda_scaled_logr__,__vrda_scaled_logr__ + .weak vrda_scaled_logr_ + .set vrda_scaled_logr_,__vrda_scaled_logr__ + +# Fortran interface parameters are passed in by Linux as: +# rdi - int n +# rsi - double *x +# rdx - double *y +# rcx - double *b + + .text + .align 16 + .p2align 4,,15 + +#x/* a FORTRAN subroutine implementation of array log +#** VRDA_SCALED_LOG(N,X,Y,B) +# C equivalent*/ +#void vrda_scaled_logr__(int * n, double *x, double *y,double *b) +#{ +# vrda_scaled_logr(*n,x,y,b); +#} +.globl __vrda_scaled_logr__ + .type __vrda_scaled_logr__,@function +__vrda_scaled_logr__: + mov (%rdi),%edi + movlpd (%rcx),%xmm0 + +# C interface parameters are passed in by Linux as: +# rdi - int n +# rsi - double *x +# rdx - double *y +# xmm0 - double b + + .align 16 + .p2align 4,,15 +.globl vrda_scaled_logr + .type vrda_scaled_logr,@function +vrda_scaled_logr: + sub $stack_size,%rsp + +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + +# move the scale and shift factor to another register + movsd %xmm0,%xmm10 + unpcklpd %xmm10,%xmm10 + + mov %rdi,save_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vda_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +# In this second version, process the array 2 values at a time. + +.L__vda_top: +# build the input _m128d + mov save_xa(%rsp),%rsi # get x_array pointer + movlpd (%rsi),%xmm0 + movhpd 8(%rsi),%xmm0 + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + movlpd -16(%rsi),%xmm1 + movhpd -8(%rsi),%xmm1 + +# compute the logs + +# movdqa %xmm0,p_x(%rsp) # save the input values + +# use the algorithm referenced in the itanic trancendental paper. + +# reduction +# compute r = x frcpa(x) - 1 + movdqa %xmm0,%xmm8 + movdqa %xmm1,%xmm9 + + call __vrd4_frcpa@PLT + movdqa %xmm8,%xmm4 + movdqa %xmm9,%xmm7 +# invert the exponent + psllq $1,%xmm8 + psllq $1,%xmm9 + mulpd %xmm0,%xmm4 # r + mulpd %xmm1,%xmm7 # r + movdqa %xmm8,%xmm5 + paddq .L__mask_rup(%rip),%xmm8 + psrlq $53,%xmm8 + movdqa %xmm9,%xmm6 + paddq .L__mask_rup(%rip),%xmm6 + psrlq $53,%xmm6 + psubq .L__mask_3ff(%rip),%xmm8 + psubq .L__mask_3ff(%rip),%xmm6 + pshufd $0x058,%xmm8,%xmm8 + pshufd $0x058,%xmm6,%xmm6 + + + subpd .L__real_one(%rip),%xmm4 + subpd .L__real_one(%rip),%xmm7 + + cvtdq2pd %xmm8,%xmm0 #N + cvtdq2pd %xmm6,%xmm1 #N +# movdqa %xmm8,%xmm0 +# movdqa %xmm6,%xmm1 +# compute index for table lookup. if 1/2 bit set, increment the index+exponent + psrlq $42,%xmm5 + psrlq $42,%xmm9 + paddq .L__int_one(%rip),%xmm5 + paddq .L__int_one(%rip),%xmm9 + psrlq $1,%xmm5 + psrlq $1,%xmm9 + pand .L__mask_3ff(%rip),%xmm5 + pand .L__mask_3ff(%rip),%xmm9 + psllq $1,%xmm5 + psllq $1,%xmm9 + + movdqa %xmm5,p_x(%rsp) # move the indexes to a memory location + movdqa %xmm9,p_x2(%rsp) + + + movapd .L__real_third(%rip),%xmm3 + movdqa %xmm3,%xmm5 + movapd %xmm4,%xmm2 + movapd %xmm7,%xmm8 + +# approximation +# compute the polynomial +# p(r) = p1r^2+p2r^3+p3r^4+p4r^5 + + mulpd %xmm4,%xmm2 #r^2 + mulpd %xmm7,%xmm8 #r^2 + + mulpd %xmm4,%xmm3 # 1/3r + mulpd %xmm7,%xmm5 # 1/3r +# lookup the f(k) term + lea .L__np_lnf_table(%rip),%rdx + mov p_x(%rsp),%rcx + mov p_x+8(%rsp),%r9 + movlpd (%rdx,%rcx,8),%xmm6 # lookup + movhpd (%rdx,%r9,8),%xmm6 # lookup + + addpd .L__real_half(%rip),%xmm3 # p2 + p3r + addpd .L__real_half(%rip),%xmm5 # p2 + p3r + + mov p_x2(%rsp),%rcx + mov p_x2+8(%rsp),%r9 + movlpd (%rdx,%rcx,8),%xmm9 # lookup + movhpd (%rdx,%r9,8),%xmm9 # lookup + + mulpd %xmm3,%xmm2 # r2(p2 + p3r) + mulpd %xmm5,%xmm8 # r2(p2 + p3r) + addpd %xmm4,%xmm2 # +r + addpd %xmm7,%xmm8 # +r + + +# reconstruction +# compute ln(x) = T + r + p(r) where +# T = N*ln(2)+ln(1/frcpa(x)) via tab of ln(1/frcpa(y)), where y = 1 + k/256, 0<=k<=255 + + mulpd .L__real_log2(%rip),%xmm0 # compute N*__real_log2 + mulpd .L__real_log2(%rip),%xmm1 # compute N*__real_log2 + addpd %xmm6,%xmm2 # add the new mantissas + addpd %xmm9,%xmm8 # add the new mantissas + addpd %xmm2,%xmm0 + addpd %xmm8,%xmm1 + + +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + mulpd %xmm10,%xmm0 + movlpd %xmm0,(%rdi) + movhpd %xmm0,8(%rdi) + + + prefetch 64(%rdi) + add $32,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + +# store the result _m128d + mulpd %xmm10,%xmm1 + movlpd %xmm1,-16(%rdi) + movhpd %xmm1,-8(%rdi) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vda_top + + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vda_cleanup + + +.L__finish: + add $stack_size,%rsp + ret + + .align 16 + + + +# we jump here when we have an odd number of log calls to make at the +# end +# we assume that rdx is pointing at the next x array element, +# r8 at the next y array element. The number of values left is in +# save_nv +.L__vda_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__finish # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + +# fill in a m128d with zeroes and the extra values and then make a recursive call. + xorpd %xmm0,%xmm0 + movlpd %xmm0,p_x+8(%rsp) + movapd %xmm0,p_x+16(%rsp) + + mov (%rsi),%rcx # we know there's at least one + mov %rcx,p_x(%rsp) + cmp $2,%rax + jl .L__vdacg + + mov 8(%rsi),%rcx # do the second value + mov %rcx,p_x+8(%rsp) + cmp $3,%rax + jl .L__vdacg + + mov 16(%rsi),%rcx # do the third value + mov %rcx,p_x+16(%rsp) + +.L__vdacg: + mov $4,%rdi # parameter for N + lea p_x(%rsp),%rsi # &x parameter + lea p2_temp(%rsp),%rdx # &y parameter + movsd %xmm10,%xmm0 + call vrda_scaled_logr@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + mov p2_temp(%rsp),%rcx + mov %rcx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vdacgf + + mov p2_temp+8(%rsp),%rcx + mov %rcx,8(%rdi) # do the second value + cmp $3,%rax + jl .L__vdacgf + + mov p2_temp+16(%rsp),%rcx + mov %rcx,16(%rdi) # do the third value + +.L__vdacgf: + jmp .L__finish + + .data + .align 64 + +.L__real_one: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 + +.L__real_half: .quad 0x0bfe0000000000000 # 1/2 + .quad 0x0bfe0000000000000 +.L__real_third: .quad 0x03fd5555555555555 # 1/3 + .quad 0x03fd5555555555555 +.L__real_fourth: .quad 0x0bfd0000000000000 # 1/4 + .quad 0x0bfd0000000000000 + +.L__real_log2: .quad 0x03FE62E42FEFA39EF # 0.693147182465 + .quad 0x03FE62E42FEFA39EF + +.L__mask_3ff: .quad 0x000000000000003ff # + .quad 0x000000000000003ff + +.L__mask_rup: .quad 0x0000003fffffffffe + .quad 0x0000003fffffffffe + +.L__int_one: .quad 0x00000000000000001 + .quad 0x00000000000000001 + + + + +.L__np_lnf_table: +#log table Program - logtab.c +#Built Jan 18 2006 09:51:57 +#Compiler version 1400 + + .quad 0x00000000000000000 # 0.000000000000 0 + .quad 0x00000000000000000 + .quad 0x03F50020055655885 # 0.000977039648 1 + .quad 0x03F50020055655885 + .quad 0x03F60040155D5881E # 0.001955034836 2 + .quad 0x03F60040155D5881E + .quad 0x03F6809048289860A # 0.002933987435 3 + .quad 0x03F6809048289860A + .quad 0x03F70080559588B25 # 0.003913899321 4 + .quad 0x03F70080559588B25 + .quad 0x03F740C8A7478788D # 0.004894772377 5 + .quad 0x03F740C8A7478788D + .quad 0x03F78121214586B02 # 0.005876608489 6 + .quad 0x03F78121214586B02 + .quad 0x03F7C189CBB0E283F # 0.006859409551 7 + .quad 0x03F7C189CBB0E283F + .quad 0x03F8010157588DE69 # 0.007843177461 8 + .quad 0x03F8010157588DE69 + .quad 0x03F82145E939EF1BC # 0.008827914124 9 + .quad 0x03F82145E939EF1BC + .quad 0x03F83D8896A83D7A8 # 0.009690354884 10 + .quad 0x03F83D8896A83D7A8 + .quad 0x03F85DDC705054DFF # 0.010676913110 11 + .quad 0x03F85DDC705054DFF + .quad 0x03F87E38762CA0C6D # 0.011664445593 12 + .quad 0x03F87E38762CA0C6D + .quad 0x03F89E9CAC6007563 # 0.012652954261 13 + .quad 0x03F89E9CAC6007563 + .quad 0x03F8BF091710935A4 # 0.013642441046 14 + .quad 0x03F8BF091710935A4 + .quad 0x03F8DF7DBA6777895 # 0.014632907884 15 + .quad 0x03F8DF7DBA6777895 + .quad 0x03F8FBEA8B13C03F9 # 0.015500371846 16 + .quad 0x03F8FBEA8B13C03F9 + .quad 0x03F90E3751F24F45C # 0.016492681528 17 + .quad 0x03F90E3751F24F45C + .quad 0x03F91E7D80B1FBF4C # 0.017485976867 18 + .quad 0x03F91E7D80B1FBF4C + .quad 0x03F92CBE4F6CC56C3 # 0.018355920375 19 + .quad 0x03F92CBE4F6CC56C3 + .quad 0x03F93D0C443D7258C # 0.019351069108 20 + .quad 0x03F93D0C443D7258C + .quad 0x03F94D5E6176ACC89 # 0.020347209148 21 + .quad 0x03F94D5E6176ACC89 + .quad 0x03F95DB4A937DEF10 # 0.021344342472 22 + .quad 0x03F95DB4A937DEF10 + .quad 0x03F96C039490E37F4 # 0.022217650494 23 + .quad 0x03F96C039490E37F4 + .quad 0x03F97C61B1CF5DED7 # 0.023216651576 24 + .quad 0x03F97C61B1CF5DED7 + .quad 0x03F98AB77B3FD6EAD # 0.024091596947 25 + .quad 0x03F98AB77B3FD6EAD + .quad 0x03F99B1D75828E780 # 0.025092472797 26 + .quad 0x03F99B1D75828E780 + .quad 0x03F9AB87A478CB7CB # 0.026094351403 27 + .quad 0x03F9AB87A478CB7CB + .quad 0x03F9B9E8027E1916F # 0.026971819338 28 + .quad 0x03F9B9E8027E1916F + .quad 0x03F9CA5A1A18613E6 # 0.027975583538 29 + .quad 0x03F9CA5A1A18613E6 + .quad 0x03F9D8C1670325921 # 0.028854704473 30 + .quad 0x03F9D8C1670325921 + .quad 0x03F9E93B6EE41F674 # 0.029860361378 31 + .quad 0x03F9E93B6EE41F674 + .quad 0x03F9F7A9B16782855 # 0.030741141554 32 + .quad 0x03F9F7A9B16782855 + .quad 0x03FA0415D89E74440 # 0.031748698315 33 + .quad 0x03FA0415D89E74440 + .quad 0x03FA0C58FA19DFAAB # 0.032757271269 34 + .quad 0x03FA0C58FA19DFAAB + .quad 0x03FA139577CC41C1A # 0.033640607815 35 + .quad 0x03FA139577CC41C1A + .quad 0x03FA1AD398C6CD57C # 0.034524725334 36 + .quad 0x03FA1AD398C6CD57C + .quad 0x03FA231C9C40E204E # 0.035536103423 37 + .quad 0x03FA231C9C40E204E + .quad 0x03FA2A5E4231CF7BD # 0.036421899115 38 + .quad 0x03FA2A5E4231CF7BD + .quad 0x03FA32AB4D4C59CB0 # 0.037435198758 39 + .quad 0x03FA32AB4D4C59CB0 + .quad 0x03FA39F07BA0EBD5A # 0.038322679007 40 + .quad 0x03FA39F07BA0EBD5A + .quad 0x03FA424192495D571 # 0.039337907520 41 + .quad 0x03FA424192495D571 + .quad 0x03FA498A4C73DA65D # 0.040227078744 42 + .quad 0x03FA498A4C73DA65D + .quad 0x03FA50D4AF75CA86F # 0.041117041297 43 + .quad 0x03FA50D4AF75CA86F + .quad 0x03FA592BBC15215BC # 0.042135112141 44 + .quad 0x03FA592BBC15215BC + .quad 0x03FA6079B00423FF6 # 0.043026775152 45 + .quad 0x03FA6079B00423FF6 + .quad 0x03FA67C94F2D4BB65 # 0.043919233935 46 + .quad 0x03FA67C94F2D4BB65 + .quad 0x03FA70265A550E77B # 0.044940163069 47 + .quad 0x03FA70265A550E77B + .quad 0x03FA77798F8D6DFDC # 0.045834331871 48 + .quad 0x03FA77798F8D6DFDC + .quad 0x03FA7ECE7267CD123 # 0.046729300926 49 + .quad 0x03FA7ECE7267CD123 + .quad 0x03FA873184BC09586 # 0.047753104446 50 + .quad 0x03FA873184BC09586 + .quad 0x03FA8E8A02D2E3175 # 0.048649793163 51 + .quad 0x03FA8E8A02D2E3175 + .quad 0x03FA95E430F8CE456 # 0.049547286652 52 + .quad 0x03FA95E430F8CE456 + .quad 0x03FA9D400FF482586 # 0.050445586359 53 + .quad 0x03FA9D400FF482586 + .quad 0x03FAA5AB21CB34A9E # 0.051473203662 54 + .quad 0x03FAA5AB21CB34A9E + .quad 0x03FAAD0AA2E784EF4 # 0.052373235867 55 + .quad 0x03FAAD0AA2E784EF4 + .quad 0x03FAB46BD74DA76A0 # 0.053274078860 56 + .quad 0x03FAB46BD74DA76A0 + .quad 0x03FABBCEBFC68F424 # 0.054175734102 57 + .quad 0x03FABBCEBFC68F424 + .quad 0x03FAC3335D1BBAE4D # 0.055078203060 58 + .quad 0x03FAC3335D1BBAE4D + .quad 0x03FACBA87200EB8F1 # 0.056110594428 59 + .quad 0x03FACBA87200EB8F1 + .quad 0x03FAD310BA20455A2 # 0.057014812019 60 + .quad 0x03FAD310BA20455A2 + .quad 0x03FADA7AB998B77ED # 0.057919847959 61 + .quad 0x03FADA7AB998B77ED + .quad 0x03FAE1E6713606CFB # 0.058825703731 62 + .quad 0x03FAE1E6713606CFB + .quad 0x03FAE953E1C48603A # 0.059732380822 63 + .quad 0x03FAE953E1C48603A + .quad 0x03FAF0C30C1116351 # 0.060639880722 64 + .quad 0x03FAF0C30C1116351 + .quad 0x03FAF833F0E927711 # 0.061548204926 65 + .quad 0x03FAF833F0E927711 + .quad 0x03FAFFA6911AB9309 # 0.062457354934 66 + .quad 0x03FAFFA6911AB9309 + .quad 0x03FB038D76BA2D737 # 0.063367332247 67 + .quad 0x03FB038D76BA2D737 + .quad 0x03FB0748836296412 # 0.064278138373 68 + .quad 0x03FB0748836296412 + .quad 0x03FB0B046EEE6F7A4 # 0.065189774824 69 + .quad 0x03FB0B046EEE6F7A4 + .quad 0x03FB0EC139C5DA5FD # 0.066102243114 70 + .quad 0x03FB0EC139C5DA5FD + .quad 0x03FB127EE451413A8 # 0.067015544762 71 + .quad 0x03FB127EE451413A8 + .quad 0x03FB163D6EF9579FC # 0.067929681294 72 + .quad 0x03FB163D6EF9579FC + .quad 0x03FB19FCDA271ABC0 # 0.068844654235 73 + .quad 0x03FB19FCDA271ABC0 + .quad 0x03FB1DBD2643D1912 # 0.069760465119 74 + .quad 0x03FB1DBD2643D1912 + .quad 0x03FB217E53B90D3CE # 0.070677115481 75 + .quad 0x03FB217E53B90D3CE + .quad 0x03FB254062F0A9417 # 0.071594606862 76 + .quad 0x03FB254062F0A9417 + .quad 0x03FB29035454CBCB0 # 0.072512940806 77 + .quad 0x03FB29035454CBCB0 + .quad 0x03FB2CC7284FE5F1A # 0.073432118863 78 + .quad 0x03FB2CC7284FE5F1A + .quad 0x03FB308BDF4CB4062 # 0.074352142586 79 + .quad 0x03FB308BDF4CB4062 + .quad 0x03FB345179B63DD3F # 0.075273013532 80 + .quad 0x03FB345179B63DD3F + .quad 0x03FB3817F7F7D6EAB # 0.076194733263 81 + .quad 0x03FB3817F7F7D6EAB + .quad 0x03FB3BDF5A7D1EE5E # 0.077117303344 82 + .quad 0x03FB3BDF5A7D1EE5E + .quad 0x03FB3F1D405CE86D3 # 0.077908755701 83 + .quad 0x03FB3F1D405CE86D3 + .quad 0x03FB42E64BEC266E4 # 0.078832909176 84 + .quad 0x03FB42E64BEC266E4 + .quad 0x03FB46B03CF437BC4 # 0.079757917501 85 + .quad 0x03FB46B03CF437BC4 + .quad 0x03FB4A7B13E1E3E65 # 0.080683782259 86 + .quad 0x03FB4A7B13E1E3E65 + .quad 0x03FB4E46D1223FE84 # 0.081610505036 87 + .quad 0x03FB4E46D1223FE84 + .quad 0x03FB52137522AE732 # 0.082538087426 88 + .quad 0x03FB52137522AE732 + .quad 0x03FB5555DE434F2A0 # 0.083333843436 89 + .quad 0x03FB5555DE434F2A0 + .quad 0x03FB59242FF043D34 # 0.084263026485 90 + .quad 0x03FB59242FF043D34 + .quad 0x03FB5CF36997817B2 # 0.085193073719 91 + .quad 0x03FB5CF36997817B2 + .quad 0x03FB60C38BA799459 # 0.086123986746 92 + .quad 0x03FB60C38BA799459 + .quad 0x03FB6408F471C82A2 # 0.086922602521 93 + .quad 0x03FB6408F471C82A2 + .quad 0x03FB67DAC7466CB96 # 0.087855127734 94 + .quad 0x03FB67DAC7466CB96 + .quad 0x03FB6BAD83C1883BA # 0.088788523361 95 + .quad 0x03FB6BAD83C1883BA + .quad 0x03FB6EF528C056A2D # 0.089589270768 96 + .quad 0x03FB6EF528C056A2D + .quad 0x03FB72C9985035BB1 # 0.090524287199 97 + .quad 0x03FB72C9985035BB1 + .quad 0x03FB769EF2C6B5688 # 0.091460178704 98 + .quad 0x03FB769EF2C6B5688 + .quad 0x03FB79E8D70A364C6 # 0.092263069152 99 + .quad 0x03FB79E8D70A364C6 + .quad 0x03FB7DBFE6EA733FE # 0.093200590148 100 + .quad 0x03FB7DBFE6EA733FE + .quad 0x03FB8197E2F40E3F0 # 0.094138990914 101 + .quad 0x03FB8197E2F40E3F0 + .quad 0x03FB84E40992A4804 # 0.094944035906 102 + .quad 0x03FB84E40992A4804 + .quad 0x03FB88BDBD5FC66D2 # 0.095884074919 103 + .quad 0x03FB88BDBD5FC66D2 + .quad 0x03FB8C985E9B9EC7E # 0.096824998438 104 + .quad 0x03FB8C985E9B9EC7E + .quad 0x03FB8FE6CAB20E979 # 0.097632209567 105 + .quad 0x03FB8FE6CAB20E979 + .quad 0x03FB93C3261014C65 # 0.098574780162 106 + .quad 0x03FB93C3261014C65 + .quad 0x03FB97130DC9235DE # 0.099383405543 107 + .quad 0x03FB97130DC9235DE + .quad 0x03FB9AF124D64C623 # 0.100327628989 108 + .quad 0x03FB9AF124D64C623 + .quad 0x03FB9E4289871E964 # 0.101137673586 109 + .quad 0x03FB9E4289871E964 + .quad 0x03FBA2225DD276FCB # 0.102083555691 110 + .quad 0x03FBA2225DD276FCB + .quad 0x03FBA57540D1FE441 # 0.102895024494 111 + .quad 0x03FBA57540D1FE441 + .quad 0x03FBA956D3ECADE60 # 0.103842571097 112 + .quad 0x03FBA956D3ECADE60 + .quad 0x03FBACAB3693AB9C0 # 0.104655469123 113 + .quad 0x03FBACAB3693AB9C0 + .quad 0x03FBB08E8A10F96F4 # 0.105604686090 114 + .quad 0x03FBB08E8A10F96F4 + .quad 0x03FBB3E46DBA02181 # 0.106419018383 115 + .quad 0x03FBB3E46DBA02181 + .quad 0x03FBB7C9832F58018 # 0.107369911615 116 + .quad 0x03FBB7C9832F58018 + .quad 0x03FBBB20E936D6976 # 0.108185683244 117 + .quad 0x03FBBB20E936D6976 + .quad 0x03FBBF07C23BC54EA # 0.109138258671 118 + .quad 0x03FBBF07C23BC54EA + .quad 0x03FBC260ABFFFE972 # 0.109955474734 119 + .quad 0x03FBC260ABFFFE972 + .quad 0x03FBC6494A2E418A0 # 0.110909738320 120 + .quad 0x03FBC6494A2E418A0 + .quad 0x03FBC9A3B90F57748 # 0.111728403941 121 + .quad 0x03FBC9A3B90F57748 + .quad 0x03FBCCFEDBFEE13A8 # 0.112547740324 122 + .quad 0x03FBCCFEDBFEE13A8 + .quad 0x03FBD0EA1362CDBFC # 0.113504482008 123 + .quad 0x03FBD0EA1362CDBFC + .quad 0x03FBD446BD753D433 # 0.114325275488 124 + .quad 0x03FBD446BD753D433 + .quad 0x03FBD7A41C8627307 # 0.115146743223 125 + .quad 0x03FBD7A41C8627307 + .quad 0x03FBDB91F09680DF9 # 0.116105975911 126 + .quad 0x03FBDB91F09680DF9 + .quad 0x03FBDEF0D8D466DBB # 0.116928908339 127 + .quad 0x03FBDEF0D8D466DBB + .quad 0x03FBE2507702AF03B # 0.117752518544 128 + .quad 0x03FBE2507702AF03B + .quad 0x03FBE640EB3D2B411 # 0.118714255240 129 + .quad 0x03FBE640EB3D2B411 + .quad 0x03FBE9A214A69DD58 # 0.119539337795 130 + .quad 0x03FBE9A214A69DD58 + .quad 0x03FBED03F4F440969 # 0.120365101673 131 + .quad 0x03FBED03F4F440969 + .quad 0x03FBF0F70CDD992E4 # 0.121329355484 132 + .quad 0x03FBF0F70CDD992E4 + .quad 0x03FBF45A7A78B7C3B # 0.122156599431 133 + .quad 0x03FBF45A7A78B7C3B + .quad 0x03FBF7BE9FEDBFDED # 0.122984528276 134 + .quad 0x03FBF7BE9FEDBFDED + .quad 0x03FBFB237D8AB13FB # 0.123813143156 135 + .quad 0x03FBFB237D8AB13FB + .quad 0x03FBFF1A13EAC95FD # 0.124780729104 136 + .quad 0x03FBFF1A13EAC95FD + .quad 0x03FC014040CAB0229 # 0.125610834299 137 + .quad 0x03FC014040CAB0229 + .quad 0x03FC02F3D4301417B # 0.126441629140 138 + .quad 0x03FC02F3D4301417B + .quad 0x03FC04A7C44CF87A4 # 0.127273114776 139 + .quad 0x03FC04A7C44CF87A4 + .quad 0x03FC06A4D1D26C5E9 # 0.128244055971 140 + .quad 0x03FC06A4D1D26C5E9 + .quad 0x03FC08598B59E3A07 # 0.129077042275 141 + .quad 0x03FC08598B59E3A07 + .quad 0x03FC0A0EA2164AF02 # 0.129910723024 142 + .quad 0x03FC0A0EA2164AF02 + .quad 0x03FC0BC4162F73B66 # 0.130745099376 143 + .quad 0x03FC0BC4162F73B66 + .quad 0x03FC0D79E7CD48E58 # 0.131580172493 144 + .quad 0x03FC0D79E7CD48E58 + .quad 0x03FC0F301717CF0FB # 0.132415943541 145 + .quad 0x03FC0F301717CF0FB + .quad 0x03FC10E6A437247B7 # 0.133252413686 146 + .quad 0x03FC10E6A437247B7 + .quad 0x03FC12E6BFA8FEAD6 # 0.134229180665 147 + .quad 0x03FC12E6BFA8FEAD6 + .quad 0x03FC149E189F8642E # 0.135067169541 148 + .quad 0x03FC149E189F8642E + .quad 0x03FC1655CFEA923A4 # 0.135905861231 149 + .quad 0x03FC1655CFEA923A4 + .quad 0x03FC180DE5B2ACE5C # 0.136745256915 150 + .quad 0x03FC180DE5B2ACE5C + .quad 0x03FC19C65A207AC07 # 0.137585357777 151 + .quad 0x03FC19C65A207AC07 + .quad 0x03FC1B7F2D5CBA842 # 0.138426165001 152 + .quad 0x03FC1B7F2D5CBA842 + .quad 0x03FC1D385F90453F2 # 0.139267679777 153 + .quad 0x03FC1D385F90453F2 + .quad 0x03FC1EF1F0E40E6CD # 0.140109903297 154 + .quad 0x03FC1EF1F0E40E6CD + .quad 0x03FC20ABE18124098 # 0.140952836755 155 + .quad 0x03FC20ABE18124098 + .quad 0x03FC22663190AEACC # 0.141796481350 156 + .quad 0x03FC22663190AEACC + .quad 0x03FC2420E13BF19E3 # 0.142640838281 157 + .quad 0x03FC2420E13BF19E3 + .quad 0x03FC25DBF0AC4AED2 # 0.143485908754 158 + .quad 0x03FC25DBF0AC4AED2 + .quad 0x03FC2797600B3387B # 0.144331693975 159 + .quad 0x03FC2797600B3387B + .quad 0x03FC29532F823F525 # 0.145178195155 160 + .quad 0x03FC29532F823F525 + .quad 0x03FC2B0F5F3B1D3EF # 0.146025413505 161 + .quad 0x03FC2B0F5F3B1D3EF + .quad 0x03FC2CCBEF5F97653 # 0.146873350243 162 + .quad 0x03FC2CCBEF5F97653 + .quad 0x03FC2E88E01993187 # 0.147722006588 163 + .quad 0x03FC2E88E01993187 + .quad 0x03FC3046319311009 # 0.148571383763 164 + .quad 0x03FC3046319311009 + .quad 0x03FC3203E3F62D328 # 0.149421482992 165 + .quad 0x03FC3203E3F62D328 + .quad 0x03FC33C1F76D1F469 # 0.150272305505 166 + .quad 0x03FC33C1F76D1F469 + .quad 0x03FC35806C223A70F # 0.151123852534 167 + .quad 0x03FC35806C223A70F + .quad 0x03FC373F423FED9A1 # 0.151976125313 168 + .quad 0x03FC373F423FED9A1 + .quad 0x03FC38FE79F0C3771 # 0.152829125080 169 + .quad 0x03FC38FE79F0C3771 + .quad 0x03FC3ABE135F62A12 # 0.153682853077 170 + .quad 0x03FC3ABE135F62A12 + .quad 0x03FC3C335E0447D71 # 0.154394850259 171 + .quad 0x03FC3C335E0447D71 + .quad 0x03FC3DF3AB13505F9 # 0.155249916579 172 + .quad 0x03FC3DF3AB13505F9 + .quad 0x03FC3FB45A59928CA # 0.156105714663 173 + .quad 0x03FC3FB45A59928CA + .quad 0x03FC41756C0220C81 # 0.156962245765 174 + .quad 0x03FC41756C0220C81 + .quad 0x03FC4336E03829D61 # 0.157819511141 175 + .quad 0x03FC4336E03829D61 + .quad 0x03FC44F8B726F8EFE # 0.158677512051 176 + .quad 0x03FC44F8B726F8EFE + .quad 0x03FC46BAF0F9F5DB8 # 0.159536249760 177 + .quad 0x03FC46BAF0F9F5DB8 + .quad 0x03FC48326CD3EC797 # 0.160252428262 178 + .quad 0x03FC48326CD3EC797 + .quad 0x03FC49F55C6502F81 # 0.161112520058 179 + .quad 0x03FC49F55C6502F81 + .quad 0x03FC4BB8AF55DE908 # 0.161973352249 180 + .quad 0x03FC4BB8AF55DE908 + .quad 0x03FC4D7C65D25566D # 0.162834926111 181 + .quad 0x03FC4D7C65D25566D + .quad 0x03FC4F4080065AA7F # 0.163697242922 182 + .quad 0x03FC4F4080065AA7F + .quad 0x03FC50B98CD30A759 # 0.164416408720 183 + .quad 0x03FC50B98CD30A759 + .quad 0x03FC527E5E4A1B58D # 0.165280090939 184 + .quad 0x03FC527E5E4A1B58D + .quad 0x03FC544393F5DF80F # 0.166144519750 185 + .quad 0x03FC544393F5DF80F + .quad 0x03FC56092E02BA514 # 0.167009696444 186 + .quad 0x03FC56092E02BA514 + .quad 0x03FC57837B3098F2C # 0.167731249257 187 + .quad 0x03FC57837B3098F2C + .quad 0x03FC5949CDB873419 # 0.168597800437 188 + .quad 0x03FC5949CDB873419 + .quad 0x03FC5B10851FC924A # 0.169465103180 189 + .quad 0x03FC5B10851FC924A + .quad 0x03FC5C8BC079D8289 # 0.170188430518 190 + .quad 0x03FC5C8BC079D8289 + .quad 0x03FC5E533144C1718 # 0.171057114516 191 + .quad 0x03FC5E533144C1718 + .quad 0x03FC601B076E7A8A8 # 0.171926553783 192 + .quad 0x03FC601B076E7A8A8 + .quad 0x03FC619732215D786 # 0.172651664394 193 + .quad 0x03FC619732215D786 + .quad 0x03FC635FC298F6C77 # 0.173522491735 194 + .quad 0x03FC635FC298F6C77 + .quad 0x03FC6528B8EFA5D16 # 0.174394078077 195 + .quad 0x03FC6528B8EFA5D16 + .quad 0x03FC66A5D42A3AD33 # 0.175120980777 196 + .quad 0x03FC66A5D42A3AD33 + .quad 0x03FC686F85BAD4298 # 0.175993962063 197 + .quad 0x03FC686F85BAD4298 + .quad 0x03FC6A399DABBD383 # 0.176867706111 198 + .quad 0x03FC6A399DABBD383 + .quad 0x03FC6BB7AA9F22C40 # 0.177596409780 199 + .quad 0x03FC6BB7AA9F22C40 + .quad 0x03FC6D827EB7C1E57 # 0.178471555693 200 + .quad 0x03FC6D827EB7C1E57 + .quad 0x03FC6F0128B756AB9 # 0.179201429458 201 + .quad 0x03FC6F0128B756AB9 + .quad 0x03FC70CCB9927BCF6 # 0.180077981742 202 + .quad 0x03FC70CCB9927BCF6 + .quad 0x03FC7298B1A4E32B6 # 0.180955303044 203 + .quad 0x03FC7298B1A4E32B6 + .quad 0x03FC74184F58CC7DC # 0.181686992547 204 + .quad 0x03FC74184F58CC7DC + .quad 0x03FC75E5051E74141 # 0.182565727226 205 + .quad 0x03FC75E5051E74141 + .quad 0x03FC77654128F6127 # 0.183298596442 206 + .quad 0x03FC77654128F6127 + .quad 0x03FC7932B53E97639 # 0.184178749058 207 + .quad 0x03FC7932B53E97639 + .quad 0x03FC7AB390229D8FD # 0.184912801796 208 + .quad 0x03FC7AB390229D8FD + .quad 0x03FC7C81C325B4A5E # 0.185794376934 209 + .quad 0x03FC7C81C325B4A5E + .quad 0x03FC7E033D66CD24A # 0.186529617023 210 + .quad 0x03FC7E033D66CD24A + .quad 0x03FC7FD22FF599D4C # 0.187412619288 211 + .quad 0x03FC7FD22FF599D4C + .quad 0x03FC81544A17F67C1 # 0.188149050576 212 + .quad 0x03FC81544A17F67C1 + .quad 0x03FC8323FCD17DAC8 # 0.189033484595 213 + .quad 0x03FC8323FCD17DAC8 + .quad 0x03FC84A6B759F512D # 0.189771110947 214 + .quad 0x03FC84A6B759F512D + .quad 0x03FC86772ADE0201C # 0.190656981373 215 + .quad 0x03FC86772ADE0201C + .quad 0x03FC87FA865210911 # 0.191395806674 216 + .quad 0x03FC87FA865210911 + .quad 0x03FC89CBBB4136201 # 0.192283118179 217 + .quad 0x03FC89CBBB4136201 + .quad 0x03FC8B4FB826FF291 # 0.193023146334 218 + .quad 0x03FC8B4FB826FF291 + .quad 0x03FC8D21AF2299298 # 0.193911903613 219 + .quad 0x03FC8D21AF2299298 + .quad 0x03FC8EA64E00E7FC0 # 0.194653138545 220 + .quad 0x03FC8EA64E00E7FC0 + .quad 0x03FC902B36AB7681D # 0.195394923313 221 + .quad 0x03FC902B36AB7681D + .quad 0x03FC91FE49096581E # 0.196285791969 222 + .quad 0x03FC91FE49096581E + .quad 0x03FC9383D471B869B # 0.197028789254 223 + .quad 0x03FC9383D471B869B + .quad 0x03FC9557AA6B87F65 # 0.197921115309 224 + .quad 0x03FC9557AA6B87F65 + .quad 0x03FC96DDD91A0B959 # 0.198665329082 225 + .quad 0x03FC96DDD91A0B959 + .quad 0x03FC9864522D04491 # 0.199410097121 226 + .quad 0x03FC9864522D04491 + .quad 0x03FC9A3945D1A44B3 # 0.200304551564 227 + .quad 0x03FC9A3945D1A44B3 + .quad 0x03FC9BC062F26FC3B # 0.201050541900 228 + .quad 0x03FC9BC062F26FC3B + .quad 0x03FC9D47CAD2C1871 # 0.201797089154 229 + .quad 0x03FC9D47CAD2C1871 + .quad 0x03FC9F1DDD7FE4F8B # 0.202693682161 230 + .quad 0x03FC9F1DDD7FE4F8B + .quad 0x03FCA0A5EA371A910 # 0.203441457564 231 + .quad 0x03FCA0A5EA371A910 + .quad 0x03FCA22E42098F498 # 0.204189792554 232 + .quad 0x03FCA22E42098F498 + .quad 0x03FCA405751F6CCE4 # 0.205088534376 233 + .quad 0x03FCA405751F6CCE4 + .quad 0x03FCA58E729348F40 # 0.205838103409 234 + .quad 0x03FCA58E729348F40 + .quad 0x03FCA717BB7EC64A3 # 0.206588234717 235 + .quad 0x03FCA717BB7EC64A3 + .quad 0x03FCA8F010601E5FD # 0.207489135679 236 + .quad 0x03FCA8F010601E5FD + .quad 0x03FCAA79FFB8FCD48 # 0.208240506966 237 + .quad 0x03FCAA79FFB8FCD48 + .quad 0x03FCAC043AE68965A # 0.208992443238 238 + .quad 0x03FCAC043AE68965A + .quad 0x03FCAD8EC205FB6AD # 0.209744945343 239 + .quad 0x03FCAD8EC205FB6AD + .quad 0x03FCAF6895610DBAD # 0.210648695969 240 + .quad 0x03FCAF6895610DBAD + .quad 0x03FCB0F3C3FBD65C9 # 0.211402445910 241 + .quad 0x03FCB0F3C3FBD65C9 + .quad 0x03FCB27F3EE674219 # 0.212156764419 242 + .quad 0x03FCB27F3EE674219 + .quad 0x03FCB40B063E65B0F # 0.212911652354 243 + .quad 0x03FCB40B063E65B0F + .quad 0x03FCB5E65A8096C88 # 0.213818270730 244 + .quad 0x03FCB5E65A8096C88 + .quad 0x03FCB772CA646760C # 0.214574414434 245 + .quad 0x03FCB772CA646760C + .quad 0x03FCB8FF871461198 # 0.215331130323 246 + .quad 0x03FCB8FF871461198 + .quad 0x03FCBA8C90AE4AD19 # 0.216088419265 247 + .quad 0x03FCBA8C90AE4AD19 + .quad 0x03FCBC19E74FFCBDA # 0.216846282128 248 + .quad 0x03FCBC19E74FFCBDA + .quad 0x03FCBDF71B83DAE7A # 0.217756476365 249 + .quad 0x03FCBDF71B83DAE7A + .quad 0x03FCBF851C067555C # 0.218515604922 250 + .quad 0x03FCBF851C067555C + .quad 0x03FCC11369F0CDB3C # 0.219275310193 251 + .quad 0x03FCC11369F0CDB3C + .quad 0x03FCC2A205610593E # 0.220035593055 252 + .quad 0x03FCC2A205610593E + .quad 0x03FCC430EE755023B # 0.220796454387 253 + .quad 0x03FCC430EE755023B + .quad 0x03FCC5C0254BF23A8 # 0.221557895069 254 + .quad 0x03FCC5C0254BF23A8 + .quad 0x03FCC79F9AB632BF1 # 0.222472389875 255 + .quad 0x03FCC79F9AB632BF1 + .quad 0x03FCC92F7D09ABE20 # 0.223235108240 256 + .quad 0x03FCC92F7D09ABE20 + .quad 0x03FCCABFAD80D023D # 0.223998408788 257 + .quad 0x03FCCABFAD80D023D + .quad 0x03FCCC502C3A2F1E8 # 0.224762292410 258 + .quad 0x03FCCC502C3A2F1E8 + .quad 0x03FCCDE0F9546A5E7 # 0.225526759995 259 + .quad 0x03FCCDE0F9546A5E7 + .quad 0x03FCCF7214EE356E9 # 0.226291812439 260 + .quad 0x03FCCF7214EE356E9 + .quad 0x03FCD1037F2655E7B # 0.227057450635 261 + .quad 0x03FCD1037F2655E7B + .quad 0x03FCD295381BA37E9 # 0.227823675483 262 + .quad 0x03FCD295381BA37E9 + .quad 0x03FCD4273FED08111 # 0.228590487882 263 + .quad 0x03FCD4273FED08111 + .quad 0x03FCD5B996B97FB5F # 0.229357888733 264 + .quad 0x03FCD5B996B97FB5F + .quad 0x03FCD74C3CA018C9C # 0.230125878940 265 + .quad 0x03FCD74C3CA018C9C + .quad 0x03FCD8DF31BFF3FF2 # 0.230894459410 266 + .quad 0x03FCD8DF31BFF3FF2 + .quad 0x03FCDA727638446A1 # 0.231663631050 267 + .quad 0x03FCDA727638446A1 + .quad 0x03FCDC56CAE452F5B # 0.232587418645 268 + .quad 0x03FCDC56CAE452F5B + .quad 0x03FCDDEABE5A3926E # 0.233357894066 269 + .quad 0x03FCDDEABE5A3926E + .quad 0x03FCDF7F018CE771F # 0.234128963578 270 + .quad 0x03FCDF7F018CE771F + .quad 0x03FCE113949BDEC62 # 0.234900628096 271 + .quad 0x03FCE113949BDEC62 + .quad 0x03FCE2A877A6B2C0F # 0.235672888541 272 + .quad 0x03FCE2A877A6B2C0F + .quad 0x03FCE43DAACD09BEC # 0.236445745833 273 + .quad 0x03FCE43DAACD09BEC + .quad 0x03FCE5D32E2E9CE87 # 0.237219200895 274 + .quad 0x03FCE5D32E2E9CE87 + .quad 0x03FCE76901EB38427 # 0.237993254653 275 + .quad 0x03FCE76901EB38427 + .quad 0x03FCE8ADE53F76866 # 0.238612929343 276 + .quad 0x03FCE8ADE53F76866 + .quad 0x03FCEA4449F04AAF4 # 0.239388063093 277 + .quad 0x03FCEA4449F04AAF4 + .quad 0x03FCEBDAFF5593E99 # 0.240163798141 278 + .quad 0x03FCEBDAFF5593E99 + .quad 0x03FCED72058F666C5 # 0.240940135421 279 + .quad 0x03FCED72058F666C5 + .quad 0x03FCEF095CBDE9937 # 0.241717075868 280 + .quad 0x03FCEF095CBDE9937 + .quad 0x03FCF0A1050157ED6 # 0.242494620422 281 + .quad 0x03FCF0A1050157ED6 + .quad 0x03FCF238FE79FF4BF # 0.243272770021 282 + .quad 0x03FCF238FE79FF4BF + .quad 0x03FCF3D1494840D2F # 0.244051525609 283 + .quad 0x03FCF3D1494840D2F + .quad 0x03FCF569E58C91077 # 0.244830888130 284 + .quad 0x03FCF569E58C91077 + .quad 0x03FCF702D36777DF0 # 0.245610858531 285 + .quad 0x03FCF702D36777DF0 + .quad 0x03FCF89C12F990D0C # 0.246391437760 286 + .quad 0x03FCF89C12F990D0C + .quad 0x03FCFA35A4638AE2C # 0.247172626770 287 + .quad 0x03FCFA35A4638AE2C + .quad 0x03FCFB7D86EEE3B92 # 0.247798017660 288 + .quad 0x03FCFB7D86EEE3B92 + .quad 0x03FCFD17ABFCDB683 # 0.248580306677 289 + .quad 0x03FCFD17ABFCDB683 + .quad 0x03FCFEB2233EA07CB # 0.249363208150 290 + .quad 0x03FCFEB2233EA07CB + .quad 0x03FD0026766A9671C # 0.250146723037 291 + .quad 0x03FD0026766A9671C + .quad 0x03FD00F40470C7323 # 0.250930852302 292 + .quad 0x03FD00F40470C7323 + .quad 0x03FD01C1BBC2735A3 # 0.251715596908 293 + .quad 0x03FD01C1BBC2735A3 + .quad 0x03FD028F9C7035C1D # 0.252500957822 294 + .quad 0x03FD028F9C7035C1D + .quad 0x03FD03346E0106062 # 0.253129690945 295 + .quad 0x03FD03346E0106062 + .quad 0x03FD0402994B4F041 # 0.253916163656 296 + .quad 0x03FD0402994B4F041 + .quad 0x03FD04D0EE20620AF # 0.254703255393 297 + .quad 0x03FD04D0EE20620AF + .quad 0x03FD059F6C910034D # 0.255490967131 298 + .quad 0x03FD059F6C910034D + .quad 0x03FD066E14ADF4BFD # 0.256279299848 299 + .quad 0x03FD066E14ADF4BFD + .quad 0x03FD07138604D5864 # 0.256910413785 300 + .quad 0x03FD07138604D5864 + .quad 0x03FD07E2794F3E8C1 # 0.257699866735 301 + .quad 0x03FD07E2794F3E8C1 + .quad 0x03FD08B196753A125 # 0.258489943414 302 + .quad 0x03FD08B196753A125 + .quad 0x03FD0980DD87BA2DD # 0.259280644807 303 + .quad 0x03FD0980DD87BA2DD + .quad 0x03FD0A504E97BB40C # 0.260071971904 304 + .quad 0x03FD0A504E97BB40C + .quad 0x03FD0AF660EB9E278 # 0.260705484754 305 + .quad 0x03FD0AF660EB9E278 + .quad 0x03FD0BC61DBBA97CB # 0.261497940616 306 + .quad 0x03FD0BC61DBBA97CB + .quad 0x03FD0C9604B8FC51E # 0.262291024962 307 + .quad 0x03FD0C9604B8FC51E + .quad 0x03FD0D3C7586CD5E5 # 0.262925945618 308 + .quad 0x03FD0D3C7586CD5E5 + .quad 0x03FD0E0CA89A72D29 # 0.263720163752 309 + .quad 0x03FD0E0CA89A72D29 + .quad 0x03FD0EDD060B78082 # 0.264515013170 310 + .quad 0x03FD0EDD060B78082 + .quad 0x03FD0FAD8DEB1E2C0 # 0.265310494876 311 + .quad 0x03FD0FAD8DEB1E2C0 + .quad 0x03FD10547F9D26ABC # 0.265947336165 312 + .quad 0x03FD10547F9D26ABC + .quad 0x03FD1125540925114 # 0.266743958529 313 + .quad 0x03FD1125540925114 + .quad 0x03FD11F653144CB8B # 0.267541216005 314 + .quad 0x03FD11F653144CB8B + .quad 0x03FD129DA43F5BE9E # 0.268179479949 315 + .quad 0x03FD129DA43F5BE9E + .quad 0x03FD136EF02E8290C # 0.268977883185 316 + .quad 0x03FD136EF02E8290C + .quad 0x03FD144066EDAE406 # 0.269776924378 317 + .quad 0x03FD144066EDAE406 + .quad 0x03FD14E817FF359D7 # 0.270416617347 318 + .quad 0x03FD14E817FF359D7 + .quad 0x03FD15B9DBFA9DEC8 # 0.271216809436 319 + .quad 0x03FD15B9DBFA9DEC8 + .quad 0x03FD168BCAF73B3EB # 0.272017642345 320 + .quad 0x03FD168BCAF73B3EB + .quad 0x03FD1733DC5D68DE8 # 0.272658770753 321 + .quad 0x03FD1733DC5D68DE8 + .quad 0x03FD180618EF18ADE # 0.273460759729 322 + .quad 0x03FD180618EF18ADE + .quad 0x03FD18D880B3826FE # 0.274263392407 323 + .quad 0x03FD18D880B3826FE + .quad 0x03FD1980F2DD42B6F # 0.274905962710 324 + .quad 0x03FD1980F2DD42B6F + .quad 0x03FD1A53A8902E70B # 0.275709756661 325 + .quad 0x03FD1A53A8902E70B + .quad 0x03FD1AFC59297024D # 0.276353257326 326 + .quad 0x03FD1AFC59297024D + .quad 0x03FD1BCF5D04AE1EA # 0.277158215914 327 + .quad 0x03FD1BCF5D04AE1EA + .quad 0x03FD1CA28C64BAE54 # 0.277963822983 328 + .quad 0x03FD1CA28C64BAE54 + .quad 0x03FD1D4B9E796C245 # 0.278608776246 329 + .quad 0x03FD1D4B9E796C245 + .quad 0x03FD1E1F1C5C3A06C # 0.279415553216 330 + .quad 0x03FD1E1F1C5C3A06C + .quad 0x03FD1EC86D5747AAD # 0.280061443760 331 + .quad 0x03FD1EC86D5747AAD + .quad 0x03FD1F9C39F74C559 # 0.280869394034 332 + .quad 0x03FD1F9C39F74C559 + .quad 0x03FD2070326F1F789 # 0.281677997620 333 + .quad 0x03FD2070326F1F789 + .quad 0x03FD2119E59F8789C # 0.282325351583 334 + .quad 0x03FD2119E59F8789C + .quad 0x03FD21EE2D300381C # 0.283135133796 335 + .quad 0x03FD21EE2D300381C + .quad 0x03FD22981FBEF797A # 0.283783432036 336 + .quad 0x03FD22981FBEF797A + .quad 0x03FD236CB6A339EED # 0.284594396317 337 + .quad 0x03FD236CB6A339EED + .quad 0x03FD2416E8C01F606 # 0.285243641592 338 + .quad 0x03FD2416E8C01F606 + .quad 0x03FD24EBCF3387FF6 # 0.286055791397 339 + .quad 0x03FD24EBCF3387FF6 + .quad 0x03FD2596410DF963A # 0.286705986479 340 + .quad 0x03FD2596410DF963A + .quad 0x03FD266B774C2AF55 # 0.287519325279 341 + .quad 0x03FD266B774C2AF55 + .quad 0x03FD27162913F873F # 0.288170472950 342 + .quad 0x03FD27162913F873F + .quad 0x03FD27EBAF58D8C9C # 0.288985004232 343 + .quad 0x03FD27EBAF58D8C9C + .quad 0x03FD2896A13E086A3 # 0.289637107288 344 + .quad 0x03FD2896A13E086A3 + .quad 0x03FD296C77C5C0E13 # 0.290452834554 345 + .quad 0x03FD296C77C5C0E13 + .quad 0x03FD2A17A9F88EDD2 # 0.291105895801 346 + .quad 0x03FD2A17A9F88EDD2 + .quad 0x03FD2AEDD0FF8CC2C # 0.291922822568 347 + .quad 0x03FD2AEDD0FF8CC2C + .quad 0x03FD2B9943B06BD77 # 0.292576844829 348 + .quad 0x03FD2B9943B06BD77 + .quad 0x03FD2C6FBB7360D0E # 0.293394974630 349 + .quad 0x03FD2C6FBB7360D0E + .quad 0x03FD2D1B6ED2FA90C # 0.294049960734 350 + .quad 0x03FD2D1B6ED2FA90C + .quad 0x03FD2DC73F01B0DD4 # 0.294705376127 351 + .quad 0x03FD2DC73F01B0DD4 + .quad 0x03FD2E9E2BCE12286 # 0.295525249913 352 + .quad 0x03FD2E9E2BCE12286 + .quad 0x03FD2F4A3CF22EDC2 # 0.296181633264 353 + .quad 0x03FD2F4A3CF22EDC2 + .quad 0x03FD30217B1006601 # 0.297002718785 354 + .quad 0x03FD30217B1006601 + .quad 0x03FD30CDCD5ABA762 # 0.297660072959 355 + .quad 0x03FD30CDCD5ABA762 + .quad 0x03FD31A55D07A8590 # 0.298482373803 356 + .quad 0x03FD31A55D07A8590 + .quad 0x03FD3251F0AA5CC1A # 0.299140701674 357 + .quad 0x03FD3251F0AA5CC1A + .quad 0x03FD32FEA167A6D70 # 0.299799463226 358 + .quad 0x03FD32FEA167A6D70 + .quad 0x03FD33D6A7509D491 # 0.300623525901 359 + .quad 0x03FD33D6A7509D491 + .quad 0x03FD348399ADA9D94 # 0.301283265328 360 + .quad 0x03FD348399ADA9D94 + .quad 0x03FD3530A9454ADC9 # 0.301943440298 361 + .quad 0x03FD3530A9454ADC9 + .quad 0x03FD360925EC44F5C # 0.302769272371 362 + .quad 0x03FD360925EC44F5C + .quad 0x03FD36B6776BE1116 # 0.303430429420 363 + .quad 0x03FD36B6776BE1116 + .quad 0x03FD378F469437FB4 # 0.304257490918 364 + .quad 0x03FD378F469437FB4 + .quad 0x03FD383CDA2E14ECB # 0.304919632971 365 + .quad 0x03FD383CDA2E14ECB + .quad 0x03FD38EA8B3924521 # 0.305582213748 366 + .quad 0x03FD38EA8B3924521 + .quad 0x03FD39C3D1FD60E74 # 0.306411057558 367 + .quad 0x03FD39C3D1FD60E74 + .quad 0x03FD3A71C56BB48C7 # 0.307074627589 368 + .quad 0x03FD3A71C56BB48C7 + .quad 0x03FD3B1FD66BC8D10 # 0.307738638238 369 + .quad 0x03FD3B1FD66BC8D10 + .quad 0x03FD3BF995502CB5C # 0.308569272059 370 + .quad 0x03FD3BF995502CB5C + .quad 0x03FD3CA7E8FD01DF6 # 0.309234276240 371 + .quad 0x03FD3CA7E8FD01DF6 + .quad 0x03FD3D565A5C5BF11 # 0.309899722945 372 + .quad 0x03FD3D565A5C5BF11 + .quad 0x03FD3E3091E6049FB # 0.310732154526 373 + .quad 0x03FD3E3091E6049FB + .quad 0x03FD3EDF463C1683E # 0.311398599069 374 + .quad 0x03FD3EDF463C1683E + .quad 0x03FD3F8E1865A82DD # 0.312065488057 375 + .quad 0x03FD3F8E1865A82DD + .quad 0x03FD403D086CEA79B # 0.312732822082 376 + .quad 0x03FD403D086CEA79B + .quad 0x03FD4117DE854CA15 # 0.313567616354 377 + .quad 0x03FD4117DE854CA15 + .quad 0x03FD41C711E4BA15E # 0.314235953889 378 + .quad 0x03FD41C711E4BA15E + .quad 0x03FD427663431B221 # 0.314904738398 379 + .quad 0x03FD427663431B221 + .quad 0x03FD4325D2AAB6F18 # 0.315573970480 380 + .quad 0x03FD4325D2AAB6F18 + .quad 0x03FD44014838E5513 # 0.316411140893 381 + .quad 0x03FD44014838E5513 + .quad 0x03FD44B0FB5AF4F44 # 0.317081382205 382 + .quad 0x03FD44B0FB5AF4F44 + .quad 0x03FD4560CCA7CB3B2 # 0.317752073041 383 + .quad 0x03FD4560CCA7CB3B2 + .quad 0x03FD4610BC29C5E18 # 0.318423214006 384 + .quad 0x03FD4610BC29C5E18 + .quad 0x03FD46ECD216CDCB5 # 0.319262774126 385 + .quad 0x03FD46ECD216CDCB5 + .quad 0x03FD479D05B65CB60 # 0.319934930091 386 + .quad 0x03FD479D05B65CB60 + .quad 0x03FD484D57ACE5A1A # 0.320607538154 387 + .quad 0x03FD484D57ACE5A1A + .quad 0x03FD48FDC804DD1CB # 0.321280598924 388 + .quad 0x03FD48FDC804DD1CB + .quad 0x03FD49DA7F3BCC420 # 0.322122562432 389 + .quad 0x03FD49DA7F3BCC420 + .quad 0x03FD4A8B341552B09 # 0.322796644021 390 + .quad 0x03FD4A8B341552B09 + .quad 0x03FD4B3C077267E9A # 0.323471180303 391 + .quad 0x03FD4B3C077267E9A + .quad 0x03FD4BECF95D97914 # 0.324146171892 392 + .quad 0x03FD4BECF95D97914 + .quad 0x03FD4C9E09E172C3D # 0.324821619401 393 + .quad 0x03FD4C9E09E172C3D + .quad 0x03FD4D4F3908901A0 # 0.325497523449 394 + .quad 0x03FD4D4F3908901A0 + .quad 0x03FD4E2CDF1F341C1 # 0.326343046455 395 + .quad 0x03FD4E2CDF1F341C1 + .quad 0x03FD4EDE535C79642 # 0.327019979972 396 + .quad 0x03FD4EDE535C79642 + .quad 0x03FD4F8FE65F90500 # 0.327697372039 397 + .quad 0x03FD4F8FE65F90500 + .quad 0x03FD5041983326F2D # 0.328375223276 398 + .quad 0x03FD5041983326F2D + .quad 0x03FD50F368E1F0F02 # 0.329053534308 399 + .quad 0x03FD50F368E1F0F02 + .quad 0x03FD51A55876A77F5 # 0.329732305758 400 + .quad 0x03FD51A55876A77F5 + .quad 0x03FD5283EF743F98B # 0.330581418486 401 + .quad 0x03FD5283EF743F98B + .quad 0x03FD533624B59CA35 # 0.331261228165 402 + .quad 0x03FD533624B59CA35 + .quad 0x03FD53E878FFE6EAE # 0.331941500300 403 + .quad 0x03FD53E878FFE6EAE + .quad 0x03FD549AEC5DEF880 # 0.332622235521 404 + .quad 0x03FD549AEC5DEF880 + .quad 0x03FD554D7EDA8D3C4 # 0.333303434457 405 + .quad 0x03FD554D7EDA8D3C4 + .quad 0x03FD560030809C759 # 0.333985097742 406 + .quad 0x03FD560030809C759 + .quad 0x03FD56B3015AFF52C # 0.334667226008 407 + .quad 0x03FD56B3015AFF52C + .quad 0x03FD5765F1749DA6C # 0.335349819892 408 + .quad 0x03FD5765F1749DA6C + .quad 0x03FD581900D864FD7 # 0.336032880027 409 + .quad 0x03FD581900D864FD7 + .quad 0x03FD58CC2F91489F5 # 0.336716407053 410 + .quad 0x03FD58CC2F91489F5 + .quad 0x03FD59AC5618CCE38 # 0.337571473373 411 + .quad 0x03FD59AC5618CCE38 + .quad 0x03FD5A5FCB795780C # 0.338256053239 412 + .quad 0x03FD5A5FCB795780C + .quad 0x03FD5B136052BCE39 # 0.338941102075 413 + .quad 0x03FD5B136052BCE39 + .quad 0x03FD5BC714B008E23 # 0.339626620526 414 + .quad 0x03FD5BC714B008E23 + .quad 0x03FD5C7AE89C4D254 # 0.340312609234 415 + .quad 0x03FD5C7AE89C4D254 + .quad 0x03FD5D2EDC22A12BA # 0.340999068845 416 + .quad 0x03FD5D2EDC22A12BA + .quad 0x03FD5DE2EF4E224D6 # 0.341686000008 417 + .quad 0x03FD5DE2EF4E224D6 + .quad 0x03FD5E972229F3C15 # 0.342373403369 418 + .quad 0x03FD5E972229F3C15 + .quad 0x03FD5F4B74C13EA04 # 0.343061279578 419 + .quad 0x03FD5F4B74C13EA04 + .quad 0x03FD5FFFE71F31E9A # 0.343749629287 420 + .quad 0x03FD5FFFE71F31E9A + .quad 0x03FD60B4794F02875 # 0.344438453147 421 + .quad 0x03FD60B4794F02875 + .quad 0x03FD61692B5BEB520 # 0.345127751813 422 + .quad 0x03FD61692B5BEB520 + .quad 0x03FD621DFD512D14F # 0.345817525940 423 + .quad 0x03FD621DFD512D14F + .quad 0x03FD62D2EF3A0E933 # 0.346507776183 424 + .quad 0x03FD62D2EF3A0E933 + .quad 0x03FD63880121DC8AB # 0.347198503200 425 + .quad 0x03FD63880121DC8AB + .quad 0x03FD643D3313E9B92 # 0.347889707652 426 + .quad 0x03FD643D3313E9B92 + .quad 0x03FD64F2851B8EE01 # 0.348581390197 427 + .quad 0x03FD64F2851B8EE01 + .quad 0x03FD65A7F7442AC90 # 0.349273551498 428 + .quad 0x03FD65A7F7442AC90 + .quad 0x03FD665D8999224A5 # 0.349966192218 429 + .quad 0x03FD665D8999224A5 + .quad 0x03FD67133C25E04A5 # 0.350659313022 430 + .quad 0x03FD67133C25E04A5 + .quad 0x03FD67C90EF5D5C4C # 0.351352914576 431 + .quad 0x03FD67C90EF5D5C4C + .quad 0x03FD687F021479CEE # 0.352046997547 432 + .quad 0x03FD687F021479CEE + .quad 0x03FD6935158D499B3 # 0.352741562603 433 + .quad 0x03FD6935158D499B3 + .quad 0x03FD69EB496BC87E5 # 0.353436610416 434 + .quad 0x03FD69EB496BC87E5 + .quad 0x03FD6AA19DBB7FF34 # 0.354132141656 435 + .quad 0x03FD6AA19DBB7FF34 + .quad 0x03FD6B581287FF9FD # 0.354828156996 436 + .quad 0x03FD6B581287FF9FD + .quad 0x03FD6C0EA7DCDD591 # 0.355524657112 437 + .quad 0x03FD6C0EA7DCDD591 + .quad 0x03FD6C97AD3CFCFD9 # 0.356047350738 438 + .quad 0x03FD6C97AD3CFCFD9 + .quad 0x03FD6D4E7B9C727EC # 0.356744700836 439 + .quad 0x03FD6D4E7B9C727EC + .quad 0x03FD6E056AA4421D6 # 0.357442537571 440 + .quad 0x03FD6E056AA4421D6 + .quad 0x03FD6EBC7A6019066 # 0.358140861621 441 + .quad 0x03FD6EBC7A6019066 + .quad 0x03FD6F73AADBAAAB7 # 0.358839673669 442 + .quad 0x03FD6F73AADBAAAB7 + .quad 0x03FD702AFC22B0C6D # 0.359538974397 443 + .quad 0x03FD702AFC22B0C6D + .quad 0x03FD70E26E40EB5FA # 0.360238764489 444 + .quad 0x03FD70E26E40EB5FA + .quad 0x03FD719A014220CF5 # 0.360939044629 445 + .quad 0x03FD719A014220CF5 + .quad 0x03FD7251B5321DC54 # 0.361639815506 446 + .quad 0x03FD7251B5321DC54 + .quad 0x03FD73098A1CB54BA # 0.362341077807 447 + .quad 0x03FD73098A1CB54BA + .quad 0x03FD73937F783CEBA # 0.362867347444 448 + .quad 0x03FD73937F783CEBA + .quad 0x03FD744B8E35E9EDA # 0.363569471398 449 + .quad 0x03FD744B8E35E9EDA + .quad 0x03FD7503BE0ED6C66 # 0.364272088676 450 + .quad 0x03FD7503BE0ED6C66 + .quad 0x03FD75BC0F0EEE7DE # 0.364975199972 451 + .quad 0x03FD75BC0F0EEE7DE + .quad 0x03FD76748142228C7 # 0.365678805982 452 + .quad 0x03FD76748142228C7 + .quad 0x03FD772D14B46AE00 # 0.366382907402 453 + .quad 0x03FD772D14B46AE00 + .quad 0x03FD77E5C971C5E06 # 0.367087504930 454 + .quad 0x03FD77E5C971C5E06 + .quad 0x03FD787066E04915F # 0.367616279067 455 + .quad 0x03FD787066E04915F + .quad 0x03FD792955FDF47A3 # 0.368321746469 456 + .quad 0x03FD792955FDF47A3 + .quad 0x03FD79E26687CFB3D # 0.369027711906 457 + .quad 0x03FD79E26687CFB3D + .quad 0x03FD7A9B9889F19E2 # 0.369734176082 458 + .quad 0x03FD7A9B9889F19E2 + .quad 0x03FD7B54EC1077A48 # 0.370441139703 459 + .quad 0x03FD7B54EC1077A48 + .quad 0x03FD7C0E612785C74 # 0.371148603475 460 + .quad 0x03FD7C0E612785C74 + .quad 0x03FD7C998F06FB152 # 0.371679529954 461 + .quad 0x03FD7C998F06FB152 + .quad 0x03FD7D533EF841E8A # 0.372387870696 462 + .quad 0x03FD7D533EF841E8A + .quad 0x03FD7E0D109B95F19 # 0.373096713539 463 + .quad 0x03FD7E0D109B95F19 + .quad 0x03FD7EC703FD340AA # 0.373806059198 464 + .quad 0x03FD7EC703FD340AA + .quad 0x03FD7F8119295FB9B # 0.374515908385 465 + .quad 0x03FD7F8119295FB9B + .quad 0x03FD800CBF3ED1CC2 # 0.375048626146 466 + .quad 0x03FD800CBF3ED1CC2 + .quad 0x03FD80C70FAB0BDF6 # 0.375759358229 467 + .quad 0x03FD80C70FAB0BDF6 + .quad 0x03FD81818203AFC7F # 0.376470595813 468 + .quad 0x03FD81818203AFC7F + .quad 0x03FD823C16551A3C3 # 0.377182339615 469 + .quad 0x03FD823C16551A3C3 + .quad 0x03FD82C81BE4DFF4A # 0.377716480107 470 + .quad 0x03FD82C81BE4DFF4A + .quad 0x03FD8382EBC7794D1 # 0.378429111528 471 + .quad 0x03FD8382EBC7794D1 + .quad 0x03FD843DDDC4FB137 # 0.379142251156 472 + .quad 0x03FD843DDDC4FB137 + .quad 0x03FD84F8F1E9DB72B # 0.379855899714 473 + .quad 0x03FD84F8F1E9DB72B + .quad 0x03FD85855776DCBFB # 0.380391470556 474 + .quad 0x03FD85855776DCBFB + .quad 0x03FD8640A77EB3957 # 0.381106011494 475 + .quad 0x03FD8640A77EB3957 + .quad 0x03FD86FC19D05148E # 0.381821063366 476 + .quad 0x03FD86FC19D05148E + .quad 0x03FD87B7AE7845C0F # 0.382536626902 477 + .quad 0x03FD87B7AE7845C0F + .quad 0x03FD8844748678822 # 0.383073635776 478 + .quad 0x03FD8844748678822 + .quad 0x03FD89004563D3DFD # 0.383790096491 479 + .quad 0x03FD89004563D3DFD + .quad 0x03FD89BC38BA356B4 # 0.384507070890 480 + .quad 0x03FD89BC38BA356B4 + .quad 0x03FD8A4945E20894E # 0.385045139237 481 + .quad 0x03FD8A4945E20894E + .quad 0x03FD8B0575AAB1FC5 # 0.385763014358 482 + .quad 0x03FD8B0575AAB1FC5 + .quad 0x03FD8BC1C80F45A32 # 0.386481405193 483 + .quad 0x03FD8BC1C80F45A32 + .quad 0x03FD8C7E3D1C80B2F # 0.387200312485 484 + .quad 0x03FD8C7E3D1C80B2F + .quad 0x03FD8D0BABACC89EE # 0.387739832326 485 + .quad 0x03FD8D0BABACC89EE + .quad 0x03FD8DC85D7FE5013 # 0.388459645206 486 + .quad 0x03FD8DC85D7FE5013 + .quad 0x03FD8E85321ED5598 # 0.389179976589 487 + .quad 0x03FD8E85321ED5598 + .quad 0x03FD8F12E873862C7 # 0.389720565845 488 + .quad 0x03FD8F12E873862C7 + .quad 0x03FD8FCFFA1614AA0 # 0.390441806410 489 + .quad 0x03FD8FCFFA1614AA0 + .quad 0x03FD908D2EA7D9511 # 0.391163567538 490 + .quad 0x03FD908D2EA7D9511 + .quad 0x03FD911B2D09ED9D6 # 0.391705230456 491 + .quad 0x03FD911B2D09ED9D6 + .quad 0x03FD91D89EDD6B7FF # 0.392427904381 492 + .quad 0x03FD91D89EDD6B7FF + .quad 0x03FD929633C3B7D3E # 0.393151100941 493 + .quad 0x03FD929633C3B7D3E + .quad 0x03FD93247A7C99B52 # 0.393693841796 494 + .quad 0x03FD93247A7C99B52 + .quad 0x03FD93E24CE3195E8 # 0.394417954789 495 + .quad 0x03FD93E24CE3195E8 + .quad 0x03FD9470C1CB1962E # 0.394961383840 496 + .quad 0x03FD9470C1CB1962E + .quad 0x03FD952ED1D9C0435 # 0.395686415592 497 + .quad 0x03FD952ED1D9C0435 + .quad 0x03FD95ED0535EA5D9 # 0.396411973396 498 + .quad 0x03FD95ED0535EA5D9 + .quad 0x03FD967BC2EDCCE17 # 0.396956487431 499 + .quad 0x03FD967BC2EDCCE17 + .quad 0x03FD973A3431356AE # 0.397682967666 500 + .quad 0x03FD973A3431356AE + .quad 0x03FD97F8C8E64A1C7 # 0.398409976059 501 + .quad 0x03FD97F8C8E64A1C7 + .quad 0x03FD9887CFB8A3932 # 0.398955579419 502 + .quad 0x03FD9887CFB8A3932 + .quad 0x03FD9946A2946EF3C # 0.399683513937 503 + .quad 0x03FD9946A2946EF3C + .quad 0x03FD99D5D8130607C # 0.400229812776 504 + .quad 0x03FD99D5D8130607C + .quad 0x03FD9A94E93E1EC37 # 0.400958675782 505 + .quad 0x03FD9A94E93E1EC37 + .quad 0x03FD9B244D87735E8 # 0.401505671875 506 + .quad 0x03FD9B244D87735E8 + .quad 0x03FD9BE39D2A97F0B # 0.402235465741 507 + .quad 0x03FD9BE39D2A97F0B + .quad 0x03FD9CA3109266E23 # 0.402965792595 508 + .quad 0x03FD9CA3109266E23 + .quad 0x03FD9D32BEA15ED3A # 0.403513887977 509 + .quad 0x03FD9D32BEA15ED3A + .quad 0x03FD9DF270C1914A8 # 0.404245149435 510 + .quad 0x03FD9DF270C1914A8 + .quad 0x03FD9E824DEA3E135 # 0.404793946669 511 + .quad 0x03FD9E824DEA3E135 + .quad 0x03FD9F423EEBF9DA1 # 0.405526145127 512 + .quad 0x03FD9F423EEBF9DA1 + .quad 0x03FD9FD24B4D47012 # 0.406075646011 513 + .quad 0x03FD9FD24B4D47012 + .quad 0x03FDA0927B59DA6E2 # 0.406808783874 514 + .quad 0x03FDA0927B59DA6E2 + .quad 0x03FDA152CF7F3B46D # 0.407542459622 515 + .quad 0x03FDA152CF7F3B46D + .quad 0x03FDA1E32653B420E # 0.408093069896 516 + .quad 0x03FDA1E32653B420E + .quad 0x03FDA2A3B9C527DB1 # 0.408827688845 517 + .quad 0x03FDA2A3B9C527DB1 + .quad 0x03FDA33440224FA79 # 0.409379007429 518 + .quad 0x03FDA33440224FA79 + .quad 0x03FDA3F513098DD09 # 0.410114572008 519 + .quad 0x03FDA3F513098DD09 + .quad 0x03FDA485C90EBDB0C # 0.410666600728 520 + .quad 0x03FDA485C90EBDB0C + .quad 0x03FDA546DB95A721A # 0.411403113374 521 + .quad 0x03FDA546DB95A721A + .quad 0x03FDA5D7C16257437 # 0.411955854060 522 + .quad 0x03FDA5D7C16257437 + .quad 0x03FDA69913B2F6572 # 0.412693317221 523 + .quad 0x03FDA69913B2F6572 + .quad 0x03FDA72A2966BE1EA # 0.413246771713 524 + .quad 0x03FDA72A2966BE1EA + .quad 0x03FDA7EBBBAB46E8B # 0.413985187844 525 + .quad 0x03FDA7EBBBAB46E8B + .quad 0x03FDA87D0165DD199 # 0.414539357989 526 + .quad 0x03FDA87D0165DD199 + .quad 0x03FDA93ED3C8AD9E3 # 0.415278729556 527 + .quad 0x03FDA93ED3C8AD9E3 + .quad 0x03FDA9D049A9E884A # 0.415833617206 528 + .quad 0x03FDA9D049A9E884A + .quad 0x03FDAA925C5588EFA # 0.416573946686 529 + .quad 0x03FDAA925C5588EFA + .quad 0x03FDAB24027D5E8AF # 0.417129553701 530 + .quad 0x03FDAB24027D5E8AF + .quad 0x03FDABE6559C8167C # 0.417870843580 531 + .quad 0x03FDABE6559C8167C + .quad 0x03FDAC782C2B07944 # 0.418427171828 532 + .quad 0x03FDAC782C2B07944 + .quad 0x03FDAD3ABFE88A06E # 0.419169424599 533 + .quad 0x03FDAD3ABFE88A06E + .quad 0x03FDADCCC6FDF6A80 # 0.419726475955 534 + .quad 0x03FDADCCC6FDF6A80 + .quad 0x03FDAE5EE2E961227 # 0.420283837790 535 + .quad 0x03FDAE5EE2E961227 + .quad 0x03FDAF21D34189D0A # 0.421027470470 536 + .quad 0x03FDAF21D34189D0A + .quad 0x03FDAFB41FE2167B4 # 0.421585558104 537 + .quad 0x03FDAFB41FE2167B4 + .quad 0x03FDB07751416A7F3 # 0.422330159776 538 + .quad 0x03FDB07751416A7F3 + .quad 0x03FDB109CEB79DB8A # 0.422888975102 539 + .quad 0x03FDB109CEB79DB8A + .quad 0x03FDB1CD41498DF12 # 0.423634548296 540 + .quad 0x03FDB1CD41498DF12 + .quad 0x03FDB25FEFB60CB2E # 0.424194093214 541 + .quad 0x03FDB25FEFB60CB2E + .quad 0x03FDB323A3A63594A # 0.424940640468 542 + .quad 0x03FDB323A3A63594A + .quad 0x03FDB3B68329C59E9 # 0.425500916886 543 + .quad 0x03FDB3B68329C59E9 + .quad 0x03FDB44977C148F1A # 0.426061507389 544 + .quad 0x03FDB44977C148F1A + .quad 0x03FDB50D895F7773A # 0.426809450580 545 + .quad 0x03FDB50D895F7773A + .quad 0x03FDB5A0AF3D169CD # 0.427370775322 546 + .quad 0x03FDB5A0AF3D169CD + .quad 0x03FDB66502A41E541 # 0.428119698779 547 + .quad 0x03FDB66502A41E541 + .quad 0x03FDB6F859E8EF639 # 0.428681759684 548 + .quad 0x03FDB6F859E8EF639 + .quad 0x03FDB78BC664238C0 # 0.429244136679 549 + .quad 0x03FDB78BC664238C0 + .quad 0x03FDB85078123E586 # 0.429994464983 550 + .quad 0x03FDB85078123E586 + .quad 0x03FDB8E41624226C5 # 0.430557580905 551 + .quad 0x03FDB8E41624226C5 + .quad 0x03FDB9A90A06BCB3D # 0.431308895742 552 + .quad 0x03FDB9A90A06BCB3D + .quad 0x03FDBA3CD9D0B81BD # 0.431872752537 553 + .quad 0x03FDBA3CD9D0B81BD + .quad 0x03FDBAD0BEF3DB164 # 0.432436927446 554 + .quad 0x03FDBAD0BEF3DB164 + .quad 0x03FDBB9611B80E2FC # 0.433189656123 555 + .quad 0x03FDBB9611B80E2FC + .quad 0x03FDBC2A28C33B75D # 0.433754574696 556 + .quad 0x03FDBC2A28C33B75D + .quad 0x03FDBCBE553C2BDDF # 0.434319812582 557 + .quad 0x03FDBCBE553C2BDDF + .quad 0x03FDBD84073D8EC2B # 0.435073960430 558 + .quad 0x03FDBD84073D8EC2B + .quad 0x03FDBE1865CEC1EC9 # 0.435639944787 559 + .quad 0x03FDBE1865CEC1EC9 + .quad 0x03FDBEACD9E271AD1 # 0.436206249662 560 + .quad 0x03FDBEACD9E271AD1 + .quad 0x03FDBF72EB7D20355 # 0.436961822044 561 + .quad 0x03FDBF72EB7D20355 + .quad 0x03FDC00791D99132B # 0.437528876213 562 + .quad 0x03FDC00791D99132B + .quad 0x03FDC09C4DCD565AB # 0.438096252115 563 + .quad 0x03FDC09C4DCD565AB + .quad 0x03FDC162BF5DF23E4 # 0.438853254422 564 + .quad 0x03FDC162BF5DF23E4 + .quad 0x03FDC1F7ADCB3DAB0 # 0.439421382456 565 + .quad 0x03FDC1F7ADCB3DAB0 + .quad 0x03FDC28CB1E4D32FD # 0.439989833442 566 + .quad 0x03FDC28CB1E4D32FD + .quad 0x03FDC35383C8850B0 # 0.440748271097 567 + .quad 0x03FDC35383C8850B0 + .quad 0x03FDC3E8BA8CACF27 # 0.441317477070 568 + .quad 0x03FDC3E8BA8CACF27 + .quad 0x03FDC47E071233744 # 0.441887007223 569 + .quad 0x03FDC47E071233744 + .quad 0x03FDC54539A6ABCD2 # 0.442646885679 570 + .quad 0x03FDC54539A6ABCD2 + .quad 0x03FDC5DAB908186FF # 0.443217173690 571 + .quad 0x03FDC5DAB908186FF + .quad 0x03FDC6704E4016FF7 # 0.443787787115 572 + .quad 0x03FDC6704E4016FF7 + .quad 0x03FDC737E1E38F4FB # 0.444549111857 573 + .quad 0x03FDC737E1E38F4FB + .quad 0x03FDC7CDAA290FEAD # 0.445120486027 574 + .quad 0x03FDC7CDAA290FEAD + .quad 0x03FDC863885A74D16 # 0.445692186852 575 + .quad 0x03FDC863885A74D16 + .quad 0x03FDC8F97C7E299DB # 0.446264214707 576 + .quad 0x03FDC8F97C7E299DB + .quad 0x03FDC9C18EDC7C26B # 0.447027427871 577 + .quad 0x03FDC9C18EDC7C26B + .quad 0x03FDCA57B64E9DB05 # 0.447600220249 578 + .quad 0x03FDCA57B64E9DB05 + .quad 0x03FDCAEDF3C88A364 # 0.448173340907 579 + .quad 0x03FDCAEDF3C88A364 + .quad 0x03FDCB844750B9995 # 0.448746790220 580 + .quad 0x03FDCB844750B9995 + .quad 0x03FDCC4CD90B3ECE5 # 0.449511901199 581 + .quad 0x03FDCC4CD90B3ECE5 + .quad 0x03FDCCE3602341C10 # 0.450086118843 582 + .quad 0x03FDCCE3602341C10 + .quad 0x03FDCD79FD5F2BC77 # 0.450660666403 583 + .quad 0x03FDCD79FD5F2BC77 + .quad 0x03FDCE10B0C581284 # 0.451235544257 584 + .quad 0x03FDCE10B0C581284 + .quad 0x03FDCED9C27EC6607 # 0.452002562511 585 + .quad 0x03FDCED9C27EC6607 + .quad 0x03FDCF70A9B6D3810 # 0.452578212532 586 + .quad 0x03FDCF70A9B6D3810 + .quad 0x03FDD007A72F19BBC # 0.453154194116 587 + .quad 0x03FDD007A72F19BBC + .quad 0x03FDD09EBAEE29DD8 # 0.453730507647 588 + .quad 0x03FDD09EBAEE29DD8 + .quad 0x03FDD1684D49F46AE # 0.454499442710 589 + .quad 0x03FDD1684D49F46AE + .quad 0x03FDD1FF951D1F1B3 # 0.455076532271 590 + .quad 0x03FDD1FF951D1F1B3 + .quad 0x03FDD296F34D0B65C # 0.455653955057 591 + .quad 0x03FDD296F34D0B65C + .quad 0x03FDD32E67E056BD5 # 0.456231711452 592 + .quad 0x03FDD32E67E056BD5 + .quad 0x03FDD3C5F2DDA1840 # 0.456809801843 593 + .quad 0x03FDD3C5F2DDA1840 + .quad 0x03FDD490246DEFA6A # 0.457581109247 594 + .quad 0x03FDD490246DEFA6A + .quad 0x03FDD527E3D1B95FC # 0.458159980465 595 + .quad 0x03FDD527E3D1B95FC + .quad 0x03FDD5BFB9B5AE71F # 0.458739186968 596 + .quad 0x03FDD5BFB9B5AE71F + .quad 0x03FDD657A6207C0DB # 0.459318729146 597 + .quad 0x03FDD657A6207C0DB + .quad 0x03FDD6EFA918D25CE # 0.459898607388 598 + .quad 0x03FDD6EFA918D25CE + .quad 0x03FDD7BA7AD9E7DA1 # 0.460672301817 599 + .quad 0x03FDD7BA7AD9E7DA1 + .quad 0x03FDD852B28BE5A0F # 0.461252965726 600 + .quad 0x03FDD852B28BE5A0F + .quad 0x03FDD8EB00E1CCE14 # 0.461833967001 601 + .quad 0x03FDD8EB00E1CCE14 + .quad 0x03FDD98365E25ABB9 # 0.462415306035 602 + .quad 0x03FDD98365E25ABB9 + .quad 0x03FDDA1BE1944F538 # 0.462996983220 603 + .quad 0x03FDDA1BE1944F538 + .quad 0x03FDDAE75484C9615 # 0.463773079495 604 + .quad 0x03FDDAE75484C9615 + .quad 0x03FDDB8005445488B # 0.464355547233 605 + .quad 0x03FDDB8005445488B + .quad 0x03FDDC18CCCBDCB83 # 0.464938354438 606 + .quad 0x03FDDC18CCCBDCB83 + .quad 0x03FDDCB1AB222F33D # 0.465521501504 607 + .quad 0x03FDDCB1AB222F33D + .quad 0x03FDDD4AA04E1C4B7 # 0.466104988830 608 + .quad 0x03FDDD4AA04E1C4B7 + .quad 0x03FDDDE3AC56775D2 # 0.466688816812 609 + .quad 0x03FDDDE3AC56775D2 + .quad 0x03FDDE7CCF4216D6E # 0.467272985848 610 + .quad 0x03FDDE7CCF4216D6E + .quad 0x03FDDF492177D7BBC # 0.468052409114 611 + .quad 0x03FDDF492177D7BBC + .quad 0x03FDDFE279E5BF4EE # 0.468637375496 612 + .quad 0x03FDDFE279E5BF4EE + .quad 0x03FDE07BE94DCC439 # 0.469222684263 613 + .quad 0x03FDE07BE94DCC439 + .quad 0x03FDE1156FB6E2626 # 0.469808335817 614 + .quad 0x03FDE1156FB6E2626 + .quad 0x03FDE1AF0D27E88D7 # 0.470394330560 615 + .quad 0x03FDE1AF0D27E88D7 + .quad 0x03FDE248C1A7C8C26 # 0.470980668894 616 + .quad 0x03FDE248C1A7C8C26 + .quad 0x03FDE2E28D3D701CC # 0.471567351222 617 + .quad 0x03FDE2E28D3D701CC + .quad 0x03FDE37C6FEFCED73 # 0.472154377948 618 + .quad 0x03FDE37C6FEFCED73 + .quad 0x03FDE449C232C39D8 # 0.472937616681 619 + .quad 0x03FDE449C232C39D8 + .quad 0x03FDE4E3DAEDDB5F6 # 0.473525448578 620 + .quad 0x03FDE4E3DAEDDB5F6 + .quad 0x03FDE57E0ADCE1EA5 # 0.474113626224 621 + .quad 0x03FDE57E0ADCE1EA5 + .quad 0x03FDE6185206D516F # 0.474702150027 622 + .quad 0x03FDE6185206D516F + .quad 0x03FDE6B2B072B5E6F # 0.475291020395 623 + .quad 0x03FDE6B2B072B5E6F + .quad 0x03FDE74D26278887A # 0.475880237735 624 + .quad 0x03FDE74D26278887A + .quad 0x03FDE7E7B32C5453F # 0.476469802457 625 + .quad 0x03FDE7E7B32C5453F + .quad 0x03FDE882578823D52 # 0.477059714970 626 + .quad 0x03FDE882578823D52 + .quad 0x03FDE91D134204C67 # 0.477649975686 627 + .quad 0x03FDE91D134204C67 + .quad 0x03FDE9B7E6610815A # 0.478240585015 628 + .quad 0x03FDE9B7E6610815A + .quad 0x03FDEA52D0EC41E5E # 0.478831543369 629 + .quad 0x03FDEA52D0EC41E5E + .quad 0x03FDEB218376ECFC0 # 0.479620031484 630 + .quad 0x03FDEB218376ECFC0 + .quad 0x03FDEBBCA4C4E9E87 # 0.480211805838 631 + .quad 0x03FDEBBCA4C4E9E87 + .quad 0x03FDEC57DD96CD0CB # 0.480803930597 632 + .quad 0x03FDEC57DD96CD0CB + .quad 0x03FDECF32DF3B887D # 0.481396406174 633 + .quad 0x03FDECF32DF3B887D + .quad 0x03FDED8E95E2D1B88 # 0.481989232987 634 + .quad 0x03FDED8E95E2D1B88 + .quad 0x03FDEE2A156B413E5 # 0.482582411453 635 + .quad 0x03FDEE2A156B413E5 + .quad 0x03FDEEC5AC9432FCB # 0.483175941987 636 + .quad 0x03FDEEC5AC9432FCB + .quad 0x03FDEF615B64D61C7 # 0.483769825010 637 + .quad 0x03FDEF615B64D61C7 + .quad 0x03FDEFFD21E45D0D1 # 0.484364060939 638 + .quad 0x03FDEFFD21E45D0D1 + .quad 0x03FDF0990019FD887 # 0.484958650194 639 + .quad 0x03FDF0990019FD887 + .quad 0x03FDF134F60CF092D # 0.485553593197 640 + .quad 0x03FDF134F60CF092D + .quad 0x03FDF1D103C4727E4 # 0.486148890367 641 + .quad 0x03FDF1D103C4727E4 + .quad 0x03FDF26D2947C2EC5 # 0.486744542127 642 + .quad 0x03FDF26D2947C2EC5 + .quad 0x03FDF309669E24CF9 # 0.487340548899 643 + .quad 0x03FDF309669E24CF9 + .quad 0x03FDF3A5BBCEDE6E1 # 0.487936911107 644 + .quad 0x03FDF3A5BBCEDE6E1 + .quad 0x03FDF44228E13963A # 0.488533629176 645 + .quad 0x03FDF44228E13963A + .quad 0x03FDF4DEADDC82A35 # 0.489130703529 646 + .quad 0x03FDF4DEADDC82A35 + .quad 0x03FDF57B4AC80A79A # 0.489728134594 647 + .quad 0x03FDF57B4AC80A79A + .quad 0x03FDF617FFAB248ED # 0.490325922795 648 + .quad 0x03FDF617FFAB248ED + .quad 0x03FDF6B4CC8D27E87 # 0.490924068561 649 + .quad 0x03FDF6B4CC8D27E87 + .quad 0x03FDF751B1756EEC8 # 0.491522572320 650 + .quad 0x03FDF751B1756EEC8 + .quad 0x03FDF7EEAE6B5761C # 0.492121434499 651 + .quad 0x03FDF7EEAE6B5761C + .quad 0x03FDF88BC3764273B # 0.492720655530 652 + .quad 0x03FDF88BC3764273B + .quad 0x03FDF928F09D94B32 # 0.493320235842 653 + .quad 0x03FDF928F09D94B32 + .quad 0x03FDF9C635E8B6192 # 0.493920175866 654 + .quad 0x03FDF9C635E8B6192 + .quad 0x03FDFA63935F1208C # 0.494520476034 655 + .quad 0x03FDFA63935F1208C + .quad 0x03FDFB0109081751A # 0.495121136779 656 + .quad 0x03FDFB0109081751A + .quad 0x03FDFB9E96EB38311 # 0.495722158534 657 + .quad 0x03FDFB9E96EB38311 + .quad 0x03FDFC3C3D0FEA555 # 0.496323541733 658 + .quad 0x03FDFC3C3D0FEA555 + .quad 0x03FDFCD9FB7DA6DEF # 0.496925286812 659 + .quad 0x03FDFCD9FB7DA6DEF + .quad 0x03FDFD77D23BEA634 # 0.497527394206 660 + .quad 0x03FDFD77D23BEA634 + .quad 0x03FDFE15C15234EE2 # 0.498129864352 661 + .quad 0x03FDFE15C15234EE2 + .quad 0x03FDFEB3C8C80A04E # 0.498732697687 662 + .quad 0x03FDFEB3C8C80A04E + .quad 0x03FDFF51E8A4F0A74 # 0.499335894649 663 + .quad 0x03FDFF51E8A4F0A74 + .quad 0x03FDFFF020F07352E # 0.499939455677 664 + .quad 0x03FDFFF020F07352E + .quad 0x03FE004738D910023 # 0.500543381211 665 + .quad 0x03FE004738D910023 + .quad 0x03FE00966D78C41CF # 0.501147671692 666 + .quad 0x03FE00966D78C41CF + .quad 0x03FE00E5AE5B207AB # 0.501752327560 667 + .quad 0x03FE00E5AE5B207AB + .quad 0x03FE011A8B18F0ED6 # 0.502155634684 668 + .quad 0x03FE011A8B18F0ED6 + .quad 0x03FE0169E072D7311 # 0.502760900515 669 + .quad 0x03FE0169E072D7311 + .quad 0x03FE01B942198A5A1 # 0.503366532915 670 + .quad 0x03FE01B942198A5A1 + .quad 0x03FE0208B010DB642 # 0.503972532327 671 + .quad 0x03FE0208B010DB642 + .quad 0x03FE02582A5C9D122 # 0.504578899198 672 + .quad 0x03FE02582A5C9D122 + .quad 0x03FE02A7B100A3EF0 # 0.505185633972 673 + .quad 0x03FE02A7B100A3EF0 + .quad 0x03FE02F74400C64EA # 0.505792737097 674 + .quad 0x03FE02F74400C64EA + .quad 0x03FE0346E360DC4F9 # 0.506400209020 675 + .quad 0x03FE0346E360DC4F9 + .quad 0x03FE03968F24BFDB6 # 0.507008050190 676 + .quad 0x03FE03968F24BFDB6 + .quad 0x03FE03E647504CA89 # 0.507616261055 677 + .quad 0x03FE03E647504CA89 + .quad 0x03FE04360BE7603AE # 0.508224842066 678 + .quad 0x03FE04360BE7603AE + .quad 0x03FE046B4089BE0FD # 0.508630768599 679 + .quad 0x03FE046B4089BE0FD + .quad 0x03FE04BB19DCA36B3 # 0.509239967521 680 + .quad 0x03FE04BB19DCA36B3 + .quad 0x03FE050AFFA5671A5 # 0.509849537793 681 + .quad 0x03FE050AFFA5671A5 + .quad 0x03FE055AF1E7ED47B # 0.510459479867 682 + .quad 0x03FE055AF1E7ED47B + .quad 0x03FE05AAF0A81BF04 # 0.511069794198 683 + .quad 0x03FE05AAF0A81BF04 + .quad 0x03FE05FAFBE9DAE58 # 0.511680481240 684 + .quad 0x03FE05FAFBE9DAE58 + .quad 0x03FE064B13B113CDD # 0.512291541448 685 + .quad 0x03FE064B13B113CDD + .quad 0x03FE069B3801B2263 # 0.512902975280 686 + .quad 0x03FE069B3801B2263 + .quad 0x03FE06D0AC85B63A2 # 0.513310805628 687 + .quad 0x03FE06D0AC85B63A2 + .quad 0x03FE0720E5C40DF1D # 0.513922863181 688 + .quad 0x03FE0720E5C40DF1D + .quad 0x03FE07712B9648153 # 0.514535295577 689 + .quad 0x03FE07712B9648153 + .quad 0x03FE07C17E0056E7C # 0.515148103277 690 + .quad 0x03FE07C17E0056E7C + .quad 0x03FE0811DD062E889 # 0.515761286740 691 + .quad 0x03FE0811DD062E889 + .quad 0x03FE086248ABC4F3B # 0.516374846428 692 + .quad 0x03FE086248ABC4F3B + .quad 0x03FE08B2C0F512033 # 0.516988782802 693 + .quad 0x03FE08B2C0F512033 + .quad 0x03FE08E86D82DA3EE # 0.517398283218 694 + .quad 0x03FE08E86D82DA3EE + .quad 0x03FE0938FAE5D8E9B # 0.518012848432 695 + .quad 0x03FE0938FAE5D8E9B + .quad 0x03FE098994F72C539 # 0.518627791569 696 + .quad 0x03FE098994F72C539 + .quad 0x03FE09DA3BBAD339C # 0.519243113094 697 + .quad 0x03FE09DA3BBAD339C + .quad 0x03FE0A2AEF34CE3D1 # 0.519858813473 698 + .quad 0x03FE0A2AEF34CE3D1 + .quad 0x03FE0A7BAF691FE34 # 0.520474893172 699 + .quad 0x03FE0A7BAF691FE34 + .quad 0x03FE0AB18BF5823C3 # 0.520885823936 700 + .quad 0x03FE0AB18BF5823C3 + .quad 0x03FE0B02616952989 # 0.521502536876 701 + .quad 0x03FE0B02616952989 + .quad 0x03FE0B5343A234476 # 0.522119630385 702 + .quad 0x03FE0B5343A234476 + .quad 0x03FE0BA432A430CA2 # 0.522737104934 703 + .quad 0x03FE0BA432A430CA2 + .quad 0x03FE0BF52E73538CE # 0.523354960993 704 + .quad 0x03FE0BF52E73538CE + .quad 0x03FE0C463713A9E6F # 0.523973199034 705 + .quad 0x03FE0C463713A9E6F + .quad 0x03FE0C7C43F4C861E # 0.524385570174 706 + .quad 0x03FE0C7C43F4C861E + .quad 0x03FE0CCD61FAD07D2 # 0.525004445903 707 + .quad 0x03FE0CCD61FAD07D2 + .quad 0x03FE0D1E8CDCE3DB6 # 0.525623704876 708 + .quad 0x03FE0D1E8CDCE3DB6 + .quad 0x03FE0D6FC49F16E93 # 0.526243347569 709 + .quad 0x03FE0D6FC49F16E93 + .quad 0x03FE0DC109458004A # 0.526863374456 710 + .quad 0x03FE0DC109458004A + .quad 0x03FE0DF73E353F0ED # 0.527276939392 711 + .quad 0x03FE0DF73E353F0ED + .quad 0x03FE0E4898611CCE1 # 0.527897607665 712 + .quad 0x03FE0E4898611CCE1 + .quad 0x03FE0E99FF7C20738 # 0.528518661406 713 + .quad 0x03FE0E99FF7C20738 + .quad 0x03FE0EEB738A67874 # 0.529140101094 714 + .quad 0x03FE0EEB738A67874 + .quad 0x03FE0F21C81D1ADC3 # 0.529554608872 715 + .quad 0x03FE0F21C81D1ADC3 + .quad 0x03FE0F7351C9FCD7F # 0.530176692874 716 + .quad 0x03FE0F7351C9FCD7F + .quad 0x03FE0FC4E875254C1 # 0.530799164104 717 + .quad 0x03FE0FC4E875254C1 + .quad 0x03FE10168C22B8FB9 # 0.531422023047 718 + .quad 0x03FE10168C22B8FB9 + .quad 0x03FE10683CD6DEA54 # 0.532045270185 719 + .quad 0x03FE10683CD6DEA54 + .quad 0x03FE109EB9E2E4C97 # 0.532460984179 720 + .quad 0x03FE109EB9E2E4C97 + .quad 0x03FE10F08055E7785 # 0.533084879385 721 + .quad 0x03FE10F08055E7785 + .quad 0x03FE114253DA97DA0 # 0.533709164079 722 + .quad 0x03FE114253DA97DA0 + .quad 0x03FE1194347523FDC # 0.534333838748 723 + .quad 0x03FE1194347523FDC + .quad 0x03FE11CAD1789B0F8 # 0.534750505421 724 + .quad 0x03FE11CAD1789B0F8 + .quad 0x03FE121CC7EB8F7E6 # 0.535375831132 725 + .quad 0x03FE121CC7EB8F7E6 + .quad 0x03FE126ECB7F8F007 # 0.536001548120 726 + .quad 0x03FE126ECB7F8F007 + .quad 0x03FE12A57FDA37091 # 0.536418910396 727 + .quad 0x03FE12A57FDA37091 + .quad 0x03FE12F799594EFBC # 0.537045280601 728 + .quad 0x03FE12F799594EFBC + .quad 0x03FE1349C004AFB00 # 0.537672043392 729 + .quad 0x03FE1349C004AFB00 + .quad 0x03FE139BF3E094003 # 0.538299199261 730 + .quad 0x03FE139BF3E094003 + .quad 0x03FE13D2C873C5E13 # 0.538717521794 731 + .quad 0x03FE13D2C873C5E13 + .quad 0x03FE142512549C16C # 0.539345333889 732 + .quad 0x03FE142512549C16C + .quad 0x03FE14776971477F1 # 0.539973540381 733 + .quad 0x03FE14776971477F1 + .quad 0x03FE14C9CDCE0A74D # 0.540602141763 734 + .quad 0x03FE14C9CDCE0A74D + .quad 0x03FE1500C2BFD1561 # 0.541021428981 735 + .quad 0x03FE1500C2BFD1561 + .quad 0x03FE15533D3B8D7B3 # 0.541650689621 736 + .quad 0x03FE15533D3B8D7B3 + .quad 0x03FE15A5C502C6DC5 # 0.542280346478 737 + .quad 0x03FE15A5C502C6DC5 + .quad 0x03FE15DCD1973457B # 0.542700338085 738 + .quad 0x03FE15DCD1973457B + .quad 0x03FE162F6F9071F76 # 0.543330656416 739 + .quad 0x03FE162F6F9071F76 + .quad 0x03FE16821AE0A13C6 # 0.543961372300 740 + .quad 0x03FE16821AE0A13C6 + .quad 0x03FE16B93F2C12808 # 0.544382070665 741 + .quad 0x03FE16B93F2C12808 + .quad 0x03FE170C00C169B51 # 0.545013450251 742 + .quad 0x03FE170C00C169B51 + .quad 0x03FE175ECFB935CC6 # 0.545645228728 743 + .quad 0x03FE175ECFB935CC6 + .quad 0x03FE17B1AC17CBD5B # 0.546277406602 744 + .quad 0x03FE17B1AC17CBD5B + .quad 0x03FE17E8F12052E8A # 0.546699080654 745 + .quad 0x03FE17E8F12052E8A + .quad 0x03FE183BE3DE8A7AF # 0.547331925312 746 + .quad 0x03FE183BE3DE8A7AF + .quad 0x03FE188EE40F23CA7 # 0.547965170715 747 + .quad 0x03FE188EE40F23CA7 + .quad 0x03FE18C640FF75F06 # 0.548387557205 748 + .quad 0x03FE18C640FF75F06 + .quad 0x03FE191957A30FA51 # 0.549021471648 749 + .quad 0x03FE191957A30FA51 + .quad 0x03FE196C7BC4B1F3A # 0.549655788193 750 + .quad 0x03FE196C7BC4B1F3A + .quad 0x03FE19A3F0B1860BD # 0.550078889532 751 + .quad 0x03FE19A3F0B1860BD + .quad 0x03FE19F72B59A0CEC # 0.550713877383 752 + .quad 0x03FE19F72B59A0CEC + .quad 0x03FE1A4A738B7A33C # 0.551349268700 753 + .quad 0x03FE1A4A738B7A33C + .quad 0x03FE1A820089A2156 # 0.551773087312 754 + .quad 0x03FE1A820089A2156 + .quad 0x03FE1AD55F55855C8 # 0.552409152212 755 + .quad 0x03FE1AD55F55855C8 + .quad 0x03FE1B28CBB6EC93E # 0.553045621948 756 + .quad 0x03FE1B28CBB6EC93E + .quad 0x03FE1B6070DB553D8 # 0.553470160269 757 + .quad 0x03FE1B6070DB553D8 + .quad 0x03FE1BB3F3EA714F6 # 0.554107305878 758 + .quad 0x03FE1BB3F3EA714F6 + .quad 0x03FE1BEBA8316EF2C # 0.554532295260 759 + .quad 0x03FE1BEBA8316EF2C + .quad 0x03FE1C3F41FA97C6B # 0.555170118179 760 + .quad 0x03FE1C3F41FA97C6B + .quad 0x03FE1C92E96C86020 # 0.555808348176 761 + .quad 0x03FE1C92E96C86020 + .quad 0x03FE1CCAB5FBFFEE1 # 0.556234061252 762 + .quad 0x03FE1CCAB5FBFFEE1 + .quad 0x03FE1D1E743BCFC47 # 0.556872970868 763 + .quad 0x03FE1D1E743BCFC47 + .quad 0x03FE1D72403052E75 # 0.557512288951 764 + .quad 0x03FE1D72403052E75 + .quad 0x03FE1DAA251D7E433 # 0.557938728190 765 + .quad 0x03FE1DAA251D7E433 + .quad 0x03FE1DFE07F3D1DAB # 0.558578728212 766 + .quad 0x03FE1DFE07F3D1DAB + .quad 0x03FE1E35FC265D75E # 0.559005622562 767 + .quad 0x03FE1E35FC265D75E + .quad 0x03FE1E89F5EB04126 # 0.559646305979 768 + .quad 0x03FE1E89F5EB04126 + .quad 0x03FE1EDDFD77E1FEF # 0.560287400135 769 + .quad 0x03FE1EDDFD77E1FEF + .quad 0x03FE1F160A2AD0DA3 # 0.560715024687 770 + .quad 0x03FE1F160A2AD0DA3 + .quad 0x03FE1F6A28BA1B476 # 0.561356804579 771 + .quad 0x03FE1F6A28BA1B476 + .quad 0x03FE1FBE551DB43C1 # 0.561998996616 772 + .quad 0x03FE1FBE551DB43C1 + .quad 0x03FE1FF67A6684F47 # 0.562427353873 773 + .quad 0x03FE1FF67A6684F47 + .quad 0x03FE204ABDE0BE5DF # 0.563070233998 774 + .quad 0x03FE204ABDE0BE5DF + .quad 0x03FE2082F29233211 # 0.563499050471 775 + .quad 0x03FE2082F29233211 + .quad 0x03FE20D74D2FBAFE4 # 0.564142620160 776 + .quad 0x03FE20D74D2FBAFE4 + .quad 0x03FE210F91524B469 # 0.564571896835 777 + .quad 0x03FE210F91524B469 + .quad 0x03FE2164031FDA0B0 # 0.565216157568 778 + .quad 0x03FE2164031FDA0B0 + .quad 0x03FE21B882DD26040 # 0.565860833641 779 + .quad 0x03FE21B882DD26040 + .quad 0x03FE21F0DFC65CEEC # 0.566290848698 780 + .quad 0x03FE21F0DFC65CEEC + .quad 0x03FE224576C81FFE0 # 0.566936218194 781 + .quad 0x03FE224576C81FFE0 + .quad 0x03FE227DE33896A44 # 0.567366696031 782 + .quad 0x03FE227DE33896A44 + .quad 0x03FE22D2918BA4A31 # 0.568012760445 783 + .quad 0x03FE22D2918BA4A31 + .quad 0x03FE23274DE272A83 # 0.568659242528 784 + .quad 0x03FE23274DE272A83 + .quad 0x03FE235FD33D232FC # 0.569090462888 785 + .quad 0x03FE235FD33D232FC + .quad 0x03FE23B4A6F9D8688 # 0.569737642287 786 + .quad 0x03FE23B4A6F9D8688 + .quad 0x03FE23ED3BF21CA33 # 0.570169328026 787 + .quad 0x03FE23ED3BF21CA33 + .quad 0x03FE24422721A89D7 # 0.570817206248 788 + .quad 0x03FE24422721A89D7 + .quad 0x03FE247ACBC023D2B # 0.571249358372 789 + .quad 0x03FE247ACBC023D2B + .quad 0x03FE24CFCE6F80D9B # 0.571897936927 790 + .quad 0x03FE24CFCE6F80D9B + .quad 0x03FE250882BCDD7D8 # 0.572330556445 791 + .quad 0x03FE250882BCDD7D8 + .quad 0x03FE255D9CF910A56 # 0.572979836849 792 + .quad 0x03FE255D9CF910A56 + .quad 0x03FE25B2C55CD5762 # 0.573629539091 793 + .quad 0x03FE25B2C55CD5762 + .quad 0x03FE25EB92D41992D # 0.574062908546 794 + .quad 0x03FE25EB92D41992D + .quad 0x03FE2640D2D99FFEA # 0.574713315073 795 + .quad 0x03FE2640D2D99FFEA + .quad 0x03FE2679B0166F51C # 0.575147154559 796 + .quad 0x03FE2679B0166F51C + .quad 0x03FE26CF07CAD8B00 # 0.575798266899 797 + .quad 0x03FE26CF07CAD8B00 + .quad 0x03FE2707F4D5F7C40 # 0.576232577438 798 + .quad 0x03FE2707F4D5F7C40 + .quad 0x03FE275D644670606 # 0.576884397124 799 + .quad 0x03FE275D644670606 + .quad 0x03FE27966128AB11B # 0.577319179739 800 + .quad 0x03FE27966128AB11B + .quad 0x03FE27EBE8626A387 # 0.577971708311 801 + .quad 0x03FE27EBE8626A387 + .quad 0x03FE2824F52493BD2 # 0.578406964030 802 + .quad 0x03FE2824F52493BD2 + .quad 0x03FE287A9434DBC7B # 0.579060203030 803 + .quad 0x03FE287A9434DBC7B + .quad 0x03FE28B3B0DFCEB80 # 0.579495932884 804 + .quad 0x03FE28B3B0DFCEB80 + .quad 0x03FE290967D3ED18D # 0.580149883861 805 + .quad 0x03FE290967D3ED18D + .quad 0x03FE294294708B773 # 0.580586088885 806 + .quad 0x03FE294294708B773 + .quad 0x03FE29986355D8C69 # 0.581240753393 807 + .quad 0x03FE29986355D8C69 + .quad 0x03FE29D19FED0C082 # 0.581677434622 808 + .quad 0x03FE29D19FED0C082 + .quad 0x03FE2A2786D0EC107 # 0.582332814220 809 + .quad 0x03FE2A2786D0EC107 + .quad 0x03FE2A60D36BA5253 # 0.582769972697 810 + .quad 0x03FE2A60D36BA5253 + .quad 0x03FE2AB6D25B86EF7 # 0.583426068948 811 + .quad 0x03FE2AB6D25B86EF7 + .quad 0x03FE2AF02F02BE4AB # 0.583863705716 812 + .quad 0x03FE2AF02F02BE4AB + .quad 0x03FE2B46460C1C2B3 # 0.584520520190 813 + .quad 0x03FE2B46460C1C2B3 + .quad 0x03FE2B7FB2C8D1CC1 # 0.584958636297 814 + .quad 0x03FE2B7FB2C8D1CC1 + .quad 0x03FE2BD5E1F9316F2 # 0.585616170568 815 + .quad 0x03FE2BD5E1F9316F2 + .quad 0x03FE2C0F5ED46CE8D # 0.586054767066 816 + .quad 0x03FE2C0F5ED46CE8D + .quad 0x03FE2C65A6395F5F5 # 0.586713022712 817 + .quad 0x03FE2C65A6395F5F5 + .quad 0x03FE2C9F333C2FE1E # 0.587152100656 818 + .quad 0x03FE2C9F333C2FE1E + .quad 0x03FE2CF592E351AE5 # 0.587811079263 819 + .quad 0x03FE2CF592E351AE5 + .quad 0x03FE2D2F3016CE0EF # 0.588250639709 820 + .quad 0x03FE2D2F3016CE0EF + .quad 0x03FE2D85A80DC7324 # 0.588910342867 821 + .quad 0x03FE2D85A80DC7324 + .quad 0x03FE2DBF557B0DF43 # 0.589350386878 822 + .quad 0x03FE2DBF557B0DF43 + .quad 0x03FE2E15E5CF91FA7 # 0.590010816181 823 + .quad 0x03FE2E15E5CF91FA7 + .quad 0x03FE2E4FA37FC9577 # 0.590451344823 824 + .quad 0x03FE2E4FA37FC9577 + .quad 0x03FE2E8967B3BF4E1 # 0.590892067615 825 + .quad 0x03FE2E8967B3BF4E1 + .quad 0x03FE2EE01A3BED567 # 0.591553516212 826 + .quad 0x03FE2EE01A3BED567 + .quad 0x03FE2F19EEBFB00BA # 0.591994725131 827 + .quad 0x03FE2F19EEBFB00BA + .quad 0x03FE2F70B9C67A7C2 # 0.592656903723 828 + .quad 0x03FE2F70B9C67A7C2 + .quad 0x03FE2FAA9EA342D04 # 0.593098599843 829 + .quad 0x03FE2FAA9EA342D04 + .quad 0x03FE3001823684D73 # 0.593761510043 830 + .quad 0x03FE3001823684D73 + .quad 0x03FE303B7775937EF # 0.594203694441 831 + .quad 0x03FE303B7775937EF + .quad 0x03FE309273A3340FC # 0.594867337868 832 + .quad 0x03FE309273A3340FC + .quad 0x03FE30CC794DD19D0 # 0.595310011625 833 + .quad 0x03FE30CC794DD19D0 + .quad 0x03FE3106858C76BB7 # 0.595752881428 834 + .quad 0x03FE3106858C76BB7 + .quad 0x03FE315DA4434068B # 0.596417554101 835 + .quad 0x03FE315DA4434068B + .quad 0x03FE3197C0FA80E6A # 0.596860914783 836 + .quad 0x03FE3197C0FA80E6A + .quad 0x03FE31EEF86D36EF1 # 0.597526324589 837 + .quad 0x03FE31EEF86D36EF1 + .quad 0x03FE322925A66E62D # 0.597970177237 838 + .quad 0x03FE322925A66E62D + .quad 0x03FE328075E32022F # 0.598636325813 839 + .quad 0x03FE328075E32022F + .quad 0x03FE32BAB3A7B21E9 # 0.599080671521 840 + .quad 0x03FE32BAB3A7B21E9 + .quad 0x03FE32F4F80D0B1BD # 0.599525214760 841 + .quad 0x03FE32F4F80D0B1BD + .quad 0x03FE334C6B15D30DD # 0.600192400374 842 + .quad 0x03FE334C6B15D30DD + .quad 0x03FE3386C013B90D6 # 0.600637438209 843 + .quad 0x03FE3386C013B90D6 + .quad 0x03FE33DE4C086C40A # 0.601305366543 844 + .quad 0x03FE33DE4C086C40A + .quad 0x03FE3418B1A85622C # 0.601750900077 845 + .quad 0x03FE3418B1A85622C + .quad 0x03FE34531DF21CFE3 # 0.602196632199 846 + .quad 0x03FE34531DF21CFE3 + .quad 0x03FE34AACCE299BA5 # 0.602865603124 847 + .quad 0x03FE34AACCE299BA5 + .quad 0x03FE34E549DBB21EF # 0.603311832493 848 + .quad 0x03FE34E549DBB21EF + .quad 0x03FE353D11DA4F855 # 0.603981550121 849 + .quad 0x03FE353D11DA4F855 + .quad 0x03FE35779F8C43D6D # 0.604428277847 850 + .quad 0x03FE35779F8C43D6D + .quad 0x03FE35B233F13DD4A # 0.604875205229 851 + .quad 0x03FE35B233F13DD4A + .quad 0x03FE360A1F1BBA738 # 0.605545971045 852 + .quad 0x03FE360A1F1BBA738 + .quad 0x03FE3644C446F97BC # 0.605993398346 853 + .quad 0x03FE3644C446F97BC + .quad 0x03FE367F702A9EA94 # 0.606441025927 854 + .quad 0x03FE367F702A9EA94 + .quad 0x03FE36D77E9D34FD7 # 0.607112843218 855 + .quad 0x03FE36D77E9D34FD7 + .quad 0x03FE37123B54987B7 # 0.607560972287 856 + .quad 0x03FE37123B54987B7 + .quad 0x03FE376A630C0A1D6 # 0.608233542652 857 + .quad 0x03FE376A630C0A1D6 + .quad 0x03FE37A530A0D5A31 # 0.608682174333 858 + .quad 0x03FE37A530A0D5A31 + .quad 0x03FE37E004F74E13B # 0.609131007374 859 + .quad 0x03FE37E004F74E13B + .quad 0x03FE383850278CFD9 # 0.609804634884 860 + .quad 0x03FE383850278CFD9 + .quad 0x03FE3873356902AB7 # 0.610253972119 861 + .quad 0x03FE3873356902AB7 + .quad 0x03FE38AE2171976E8 # 0.610703511349 862 + .quad 0x03FE38AE2171976E8 + .quad 0x03FE390690373AFFF # 0.611378199331 863 + .quad 0x03FE390690373AFFF + .quad 0x03FE39418D3872A53 # 0.611828244343 864 + .quad 0x03FE39418D3872A53 + .quad 0x03FE397C91064221F # 0.612278491987 865 + .quad 0x03FE397C91064221F + .quad 0x03FE39D5237E045A5 # 0.612954243787 866 + .quad 0x03FE39D5237E045A5 + .quad 0x03FE3A1038522CE82 # 0.613404998809 867 + .quad 0x03FE3A1038522CE82 + .quad 0x03FE3A68E45AD354B # 0.614081512534 868 + .quad 0x03FE3A68E45AD354B + .quad 0x03FE3AA40A3F2A68B # 0.614532776080 869 + .quad 0x03FE3AA40A3F2A68B + .quad 0x03FE3ADF36F98A182 # 0.614984243356 870 + .quad 0x03FE3ADF36F98A182 + .quad 0x03FE3B3806E5DF340 # 0.615661826668 871 + .quad 0x03FE3B3806E5DF340 + .quad 0x03FE3B7344BE40311 # 0.616113804077 872 + .quad 0x03FE3B7344BE40311 + .quad 0x03FE3BAE897234A87 # 0.616565985862 873 + .quad 0x03FE3BAE897234A87 + .quad 0x03FE3C077D5F51881 # 0.617244642149 874 + .quad 0x03FE3C077D5F51881 + .quad 0x03FE3C42D33F2AE7B # 0.617697335683 875 + .quad 0x03FE3C42D33F2AE7B + .quad 0x03FE3C7E30002960C # 0.618150234241 876 + .quad 0x03FE3C7E30002960C + .quad 0x03FE3CD7480B4A8A3 # 0.618829966906 877 + .quad 0x03FE3CD7480B4A8A3 + .quad 0x03FE3D12B60622748 # 0.619283378838 878 + .quad 0x03FE3D12B60622748 + .quad 0x03FE3D4E2AE7B7E2B # 0.619736996447 879 + .quad 0x03FE3D4E2AE7B7E2B + .quad 0x03FE3D89A6B1A558D # 0.620190819917 880 + .quad 0x03FE3D89A6B1A558D + .quad 0x03FE3DE2ED57B1F9B # 0.620871941524 881 + .quad 0x03FE3DE2ED57B1F9B + .quad 0x03FE3E1E7A6D8330E # 0.621326280468 882 + .quad 0x03FE3E1E7A6D8330E + .quad 0x03FE3E5A0E714DA6E # 0.621780825931 883 + .quad 0x03FE3E5A0E714DA6E + .quad 0x03FE3EB37978B85B6 # 0.622463031756 884 + .quad 0x03FE3EB37978B85B6 + .quad 0x03FE3EEF1ED68236B # 0.622918094335 885 + .quad 0x03FE3EEF1ED68236B + .quad 0x03FE3F2ACB27ED6C7 # 0.623373364090 886 + .quad 0x03FE3F2ACB27ED6C7 + .quad 0x03FE3F845AAE68C81 # 0.624056657591 887 + .quad 0x03FE3F845AAE68C81 + .quad 0x03FE3FC0186800514 # 0.624512446113 888 + .quad 0x03FE3FC0186800514 + .quad 0x03FE3FFBDD1AE8406 # 0.624968442473 889 + .quad 0x03FE3FFBDD1AE8406 + .quad 0x03FE4037A8C8C197A # 0.625424646860 890 + .quad 0x03FE4037A8C8C197A + .quad 0x03FE409167679DD99 # 0.626109343909 891 + .quad 0x03FE409167679DD99 + .quad 0x03FE40CD448FF6DD6 # 0.626566069196 892 + .quad 0x03FE40CD448FF6DD6 + .quad 0x03FE410928B8F950F # 0.627023003177 893 + .quad 0x03FE410928B8F950F + .quad 0x03FE41630C1B50AFF # 0.627708795866 894 + .quad 0x03FE41630C1B50AFF + .quad 0x03FE419F01CD27AD0 # 0.628166252416 895 + .quad 0x03FE419F01CD27AD0 + .quad 0x03FE41DAFE85672B9 # 0.628623918328 896 + .quad 0x03FE41DAFE85672B9 + .quad 0x03FE42170245B4C6A # 0.629081793794 897 + .quad 0x03FE42170245B4C6A + .quad 0x03FE42711518DF546 # 0.629769000326 898 + .quad 0x03FE42711518DF546 + .quad 0x03FE42AD2A74888A0 # 0.630227400518 899 + .quad 0x03FE42AD2A74888A0 + .quad 0x03FE42E946DE080C0 # 0.630686010936 900 + .quad 0x03FE42E946DE080C0 + .quad 0x03FE43437EB9D9424 # 0.631374321162 901 + .quad 0x03FE43437EB9D9424 + .quad 0x03FE437FACCD31C10 # 0.631833457993 902 + .quad 0x03FE437FACCD31C10 + .quad 0x03FE43BBE1F42FE09 # 0.632292805727 903 + .quad 0x03FE43BBE1F42FE09 + .quad 0x03FE43F81E307DE5E # 0.632752364559 904 + .quad 0x03FE43F81E307DE5E + .quad 0x03FE445285D68EA69 # 0.633442099038 905 + .quad 0x03FE445285D68EA69 + .quad 0x03FE448ED3CF71355 # 0.633902186463 906 + .quad 0x03FE448ED3CF71355 + .quad 0x03FE44CB28E37C3EE # 0.634362485666 907 + .quad 0x03FE44CB28E37C3EE + .quad 0x03FE450785145CAFE # 0.634822996841 908 + .quad 0x03FE450785145CAFE + .quad 0x03FE45621CB769366 # 0.635514161481 909 + .quad 0x03FE45621CB769366 + .quad 0x03FE459E8AB7B799D # 0.635975203444 910 + .quad 0x03FE459E8AB7B799D + .quad 0x03FE45DAFFDABD4DB # 0.636436458065 911 + .quad 0x03FE45DAFFDABD4DB + .quad 0x03FE46177C2229EC0 # 0.636897925539 912 + .quad 0x03FE46177C2229EC0 + .quad 0x03FE467243F53F69E # 0.637590526283 913 + .quad 0x03FE467243F53F69E + .quad 0x03FE46AED21F117FC # 0.638052526753 914 + .quad 0x03FE46AED21F117FC + .quad 0x03FE46EB677335D13 # 0.638514740766 915 + .quad 0x03FE46EB677335D13 + .quad 0x03FE472803F35EAAE # 0.638977168520 916 + .quad 0x03FE472803F35EAAE + .quad 0x03FE4764A7A13EF3B # 0.639439810212 917 + .quad 0x03FE4764A7A13EF3B + .quad 0x03FE47BFAA9F80271 # 0.640134174319 918 + .quad 0x03FE47BFAA9F80271 + .quad 0x03FE47FC60471DAF8 # 0.640597351724 919 + .quad 0x03FE47FC60471DAF8 + .quad 0x03FE48391D226992D # 0.641060743762 920 + .quad 0x03FE48391D226992D + .quad 0x03FE4875E1331971E # 0.641524350631 921 + .quad 0x03FE4875E1331971E + .quad 0x03FE48D114D3FB884 # 0.642220164181 922 + .quad 0x03FE48D114D3FB884 + .quad 0x03FE490DEAF1A3FC8 # 0.642684309003 923 + .quad 0x03FE490DEAF1A3FC8 + .quad 0x03FE494AC84AB0ED3 # 0.643148669355 924 + .quad 0x03FE494AC84AB0ED3 + .quad 0x03FE4987ACE0DABB0 # 0.643613245438 925 + .quad 0x03FE4987ACE0DABB0 + .quad 0x03FE49C498B5DA63F # 0.644078037452 926 + .quad 0x03FE49C498B5DA63F + .quad 0x03FE4A20080EF10B2 # 0.644775630783 927 + .quad 0x03FE4A20080EF10B2 + .quad 0x03FE4A5D060894B8C # 0.645240963504 928 + .quad 0x03FE4A5D060894B8C + .quad 0x03FE4A9A0B471A943 # 0.645706512861 929 + .quad 0x03FE4A9A0B471A943 + .quad 0x03FE4AD717CC3E626 # 0.646172279055 930 + .quad 0x03FE4AD717CC3E626 + .quad 0x03FE4B142B99BC871 # 0.646638262288 931 + .quad 0x03FE4B142B99BC871 + .quad 0x03FE4B6FD6F970C1F # 0.647337644529 932 + .quad 0x03FE4B6FD6F970C1F + .quad 0x03FE4BACFD036D080 # 0.647804171246 933 + .quad 0x03FE4BACFD036D080 + .quad 0x03FE4BEA2A5BDBE87 # 0.648270915712 934 + .quad 0x03FE4BEA2A5BDBE87 + .quad 0x03FE4C275F047C956 # 0.648737878130 935 + .quad 0x03FE4C275F047C956 + .quad 0x03FE4C649AFF0EE16 # 0.649205058703 936 + .quad 0x03FE4C649AFF0EE16 + .quad 0x03FE4CC082B46485A # 0.649906239052 937 + .quad 0x03FE4CC082B46485A + .quad 0x03FE4CFDD1037E37C # 0.650373965908 938 + .quad 0x03FE4CFDD1037E37C + .quad 0x03FE4D3B26AAADDD9 # 0.650841911635 939 + .quad 0x03FE4D3B26AAADDD9 + .quad 0x03FE4D7883ABB61F6 # 0.651310076438 940 + .quad 0x03FE4D7883ABB61F6 + .quad 0x03FE4DB5E8085A477 # 0.651778460521 941 + .quad 0x03FE4DB5E8085A477 + .quad 0x03FE4DF353C25E42B # 0.652247064091 942 + .quad 0x03FE4DF353C25E42B + .quad 0x03FE4E4F832C560DD # 0.652950381434 943 + .quad 0x03FE4E4F832C560DD + .quad 0x03FE4E8D015786F16 # 0.653419534621 944 + .quad 0x03FE4E8D015786F16 + .quad 0x03FE4ECA86E64A683 # 0.653888908016 945 + .quad 0x03FE4ECA86E64A683 + .quad 0x03FE4F0813DA673DD # 0.654358501826 946 + .quad 0x03FE4F0813DA673DD + .quad 0x03FE4F45A835A4E19 # 0.654828316258 947 + .quad 0x03FE4F45A835A4E19 + .quad 0x03FE4F8343F9CB678 # 0.655298351519 948 + .quad 0x03FE4F8343F9CB678 + .quad 0x03FE4FDFBB88A119A # 0.656003818920 949 + .quad 0x03FE4FDFBB88A119A + .quad 0x03FE501D69DADD660 # 0.656474407164 950 + .quad 0x03FE501D69DADD660 + .quad 0x03FE505B1F9C43ED7 # 0.656945216966 951 + .quad 0x03FE505B1F9C43ED7 + .quad 0x03FE5098DCCE9FABA # 0.657416248534 952 + .quad 0x03FE5098DCCE9FABA + .quad 0x03FE50D6A173BC425 # 0.657887502077 953 + .quad 0x03FE50D6A173BC425 + .quad 0x03FE51146D8D65F98 # 0.658358977805 954 + .quad 0x03FE51146D8D65F98 + .quad 0x03FE5152411D69C03 # 0.658830675927 955 + .quad 0x03FE5152411D69C03 + .quad 0x03FE51AF0C774A2D0 # 0.659538640558 956 + .quad 0x03FE51AF0C774A2D0 + .quad 0x03FE51ECF2B713F8A # 0.660010895584 957 + .quad 0x03FE51ECF2B713F8A + .quad 0x03FE522AE0738A3D8 # 0.660483373741 958 + .quad 0x03FE522AE0738A3D8 + .quad 0x03FE5268D5AE7CDCB # 0.660956075239 959 + .quad 0x03FE5268D5AE7CDCB + .quad 0x03FE52A6D269BC600 # 0.661429000289 960 + .quad 0x03FE52A6D269BC600 + .quad 0x03FE52E4D6A719F9B # 0.661902149103 961 + .quad 0x03FE52E4D6A719F9B + .quad 0x03FE5322E26867857 # 0.662375521893 962 + .quad 0x03FE5322E26867857 + .quad 0x03FE53800225BA6E2 # 0.663086001497 963 + .quad 0x03FE53800225BA6E2 + .quad 0x03FE53BE20B8DA502 # 0.663559935155 964 + .quad 0x03FE53BE20B8DA502 + .quad 0x03FE53FC46D64DDD1 # 0.664034093533 965 + .quad 0x03FE53FC46D64DDD1 + .quad 0x03FE543A747FE9ED6 # 0.664508476843 966 + .quad 0x03FE543A747FE9ED6 + .quad 0x03FE5478A9B78404C # 0.664983085300 967 + .quad 0x03FE5478A9B78404C + .quad 0x03FE54B6E67EF251C # 0.665457919117 968 + .quad 0x03FE54B6E67EF251C + .quad 0x03FE54F52AD80BAE9 # 0.665932978509 969 + .quad 0x03FE54F52AD80BAE9 + .quad 0x03FE553376C4A7A16 # 0.666408263689 970 + .quad 0x03FE553376C4A7A16 + .quad 0x03FE5571CA469E5C9 # 0.666883774872 971 + .quad 0x03FE5571CA469E5C9 + .quad 0x03FE55CF55C5A5437 # 0.667597465874 972 + .quad 0x03FE55CF55C5A5437 + .quad 0x03FE560DBC45153C7 # 0.668073543008 973 + .quad 0x03FE560DBC45153C7 + .quad 0x03FE564C2A6059FE7 # 0.668549846899 974 + .quad 0x03FE564C2A6059FE7 + .quad 0x03FE568AA0194EC6E # 0.669026377763 975 + .quad 0x03FE568AA0194EC6E + .quad 0x03FE56C91D71CF810 # 0.669503135817 976 + .quad 0x03FE56C91D71CF810 + .quad 0x03FE5707A26BB8C66 # 0.669980121278 977 + .quad 0x03FE5707A26BB8C66 + .quad 0x03FE57462F08E7DF5 # 0.670457334363 978 + .quad 0x03FE57462F08E7DF5 + .quad 0x03FE5784C34B3AC30 # 0.670934775289 979 + .quad 0x03FE5784C34B3AC30 + .quad 0x03FE57C35F3490183 # 0.671412444273 980 + .quad 0x03FE57C35F3490183 + .quad 0x03FE580202C6C7353 # 0.671890341535 981 + .quad 0x03FE580202C6C7353 + .quad 0x03FE5840AE03C0204 # 0.672368467291 982 + .quad 0x03FE5840AE03C0204 + .quad 0x03FE589EBD437CA31 # 0.673086084831 983 + .quad 0x03FE589EBD437CA31 + .quad 0x03FE58DD7BB392B30 # 0.673564782782 984 + .quad 0x03FE58DD7BB392B30 + .quad 0x03FE591C41D500163 # 0.674043709994 985 + .quad 0x03FE591C41D500163 + .quad 0x03FE595B0FA9A7EF1 # 0.674522866688 986 + .quad 0x03FE595B0FA9A7EF1 + .quad 0x03FE5999E5336E121 # 0.675002253082 987 + .quad 0x03FE5999E5336E121 + .quad 0x03FE59D8C2743705E # 0.675481869398 988 + .quad 0x03FE59D8C2743705E + .quad 0x03FE5A17A76DE803B # 0.675961715857 989 + .quad 0x03FE5A17A76DE803B + .quad 0x03FE5A56942266F7B # 0.676441792678 990 + .quad 0x03FE5A56942266F7B + .quad 0x03FE5A9588939A810 # 0.676922100084 991 + .quad 0x03FE5A9588939A810 + .quad 0x03FE5AD484C369F2D # 0.677402638296 992 + .quad 0x03FE5AD484C369F2D + .quad 0x03FE5B1388B3BD53E # 0.677883407536 993 + .quad 0x03FE5B1388B3BD53E + .quad 0x03FE5B5294667D5F7 # 0.678364408027 994 + .quad 0x03FE5B5294667D5F7 + .quad 0x03FE5B91A7DD93852 # 0.678845639990 995 + .quad 0x03FE5B91A7DD93852 + .quad 0x03FE5BD0C31AE9E9D # 0.679327103649 996 + .quad 0x03FE5BD0C31AE9E9D + .quad 0x03FE5C2F7A8ED5E5B # 0.680049734055 997 + .quad 0x03FE5C2F7A8ED5E5B + .quad 0x03FE5C6EA94431EF9 # 0.680531777930 998 + .quad 0x03FE5C6EA94431EF9 + .quad 0x03FE5CADDFC6874F5 # 0.681014054284 999 + .quad 0x03FE5CADDFC6874F5 + .quad 0x03FE5CED1E17C35C6 # 0.681496563340 1000 + .quad 0x03FE5CED1E17C35C6 + .quad 0x03FE5D2C6439D4252 # 0.681979305324 1001 + .quad 0x03FE5D2C6439D4252 + .quad 0x03FE5D6BB22EA86F6 # 0.682462280460 1002 + .quad 0x03FE5D6BB22EA86F6 + .quad 0x03FE5DAB07F82FB84 # 0.682945488974 1003 + .quad 0x03FE5DAB07F82FB84 + .quad 0x03FE5DEA65985A350 # 0.683428931091 1004 + .quad 0x03FE5DEA65985A350 + .quad 0x03FE5E29CB1118D32 # 0.683912607038 1005 + .quad 0x03FE5E29CB1118D32 + .quad 0x03FE5E6938645D390 # 0.684396517040 1006 + .quad 0x03FE5E6938645D390 + .quad 0x03FE5EA8AD9419C5B # 0.684880661324 1007 + .quad 0x03FE5EA8AD9419C5B + .quad 0x03FE5EE82AA241920 # 0.685365040118 1008 + .quad 0x03FE5EE82AA241920 + .quad 0x03FE5F27AF90C8705 # 0.685849653648 1009 + .quad 0x03FE5F27AF90C8705 + .quad 0x03FE5F673C61A2ED2 # 0.686334502142 1010 + .quad 0x03FE5F673C61A2ED2 + .quad 0x03FE5FA6D116C64F7 # 0.686819585829 1011 + .quad 0x03FE5FA6D116C64F7 + .quad 0x03FE5FE66DB228992 # 0.687304904936 1012 + .quad 0x03FE5FE66DB228992 + .quad 0x03FE60261235C0874 # 0.687790459692 1013 + .quad 0x03FE60261235C0874 + .quad 0x03FE6065BEA385926 # 0.688276250325 1014 + .quad 0x03FE6065BEA385926 + .quad 0x03FE60A572FD6FEF1 # 0.688762277066 1015 + .quad 0x03FE60A572FD6FEF1 + .quad 0x03FE60E52F45788E4 # 0.689248540144 1016 + .quad 0x03FE60E52F45788E4 + .quad 0x03FE6124F37D991D4 # 0.689735039789 1017 + .quad 0x03FE6124F37D991D4 + .quad 0x03FE6164BFA7CC06C # 0.690221776231 1018 + .quad 0x03FE6164BFA7CC06C + .quad 0x03FE61A493C60C729 # 0.690708749700 1019 + .quad 0x03FE61A493C60C729 + .quad 0x03FE61E46FDA56466 # 0.691195960429 1020 + .quad 0x03FE61E46FDA56466 + .quad 0x03FE622453E6A6263 # 0.691683408647 1021 + .quad 0x03FE622453E6A6263 + .quad 0x03FE62643FECF9743 # 0.692171094587 1022 + .quad 0x03FE62643FECF9743 + .quad 0x03FE62A433EF4E51A # 0.692659018480 1023 + .quad 0x03FE62A433EF4E51A
diff --git a/src/gas/vrda_scaledshifted_logr.S b/src/gas/vrda_scaledshifted_logr.S new file mode 100644 index 0000000..960460d --- /dev/null +++ b/src/gas/vrda_scaledshifted_logr.S
@@ -0,0 +1,2451 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrda_scaledshifted_logr.s +# +# An array implementation of the log libm function. +# Adapted to provide a scalingi and shifting factor. This routine is +# used by the ACML RNG distribution functions. +# +# Prototype: +# +# void vrda_scaledshifted_logr(int n, double *x, double *y, double b,double a); +# +# Computes the natural log of x multiplied by b, plus a. +# A reduced precision routine. Uses the intel novel reduction technique +# with frcpai to compute logs. +# Also uses only 3 polynomial terms to acheive52-18= 34 significant digits +# +# This specialized routine does not handle negative numbers, 0, NaNs, or infinity. +# This routine is not C99 compliant +# This version can compute logs in 26 +# cycles with n <= 24 +# +# + + +# define local variable storage offsets +.equ p_x,0 # temporary for error checking operation +.equ p_idx,0x010 # index storage +.equ p_xexp,0x020 # index storage + +.equ p_x2,0x030 # temporary for error checking operation +.equ p_idx2,0x040 # index storage +.equ p_xexp2,0x050 # index storage + +.equ save_xa,0x060 #qword +.equ save_ya,0x068 #qword +.equ save_nv,0x070 #qword +.equ p_iter,0x078 # qword storage for number of loop iterations + + + +.equ p2_temp,0x090 # second temporary for get/put bits operation + + + +.equ stack_size,0x0e8 +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + .weak vrda_scaledshifted_logr__ + .set vrda_scaledshifted_logr__,__vrda_scaledshifted_logr__ + .weak vrda_scaledshifted_logr_ + .set vrda_scaledshifted_logr_,__vrda_scaledshifted_logr__ + +# Fortran interface parameters are passed in by Linux as: +# rdi - int n +# rsi - double *x +# rdx - double *y +# rcx - double *b +# r8 - double *a + + .text + .align 16 + .p2align 4,,15 + +#x/* a FORTRAN subroutine implementation of array log +#** VRDA_LOG(N,X,Y) +# C equivalent*/ +#void vrda_scaledshifted_logr__(int * n, double *x, double *y,double *b, double *a) +#{ +# vrda_scaledshifted_logr(*n,x,y); +#} +.globl __vrda_scaledshifted_logr__ + .type __vrda_scaledshifted_logr__,@function +__vrda_scaledshifted_logr__: + mov (%rdi),%edi + movlpd (%rcx),%xmm0 + movlpd (%r8),%xmm1 + +# C interface parameters are passed in by Linux as: +# rdi - int n +# rsi - double *x +# rdx - double *y +# xmm0 - double b +# xmm1 - double a + + .align 16 + .p2align 4,,15 +.globl vrda_scaledshifted_logr + .type vrda_scaledshifted_logr,@function +vrda_scaledshifted_logr: + sub $stack_size,%rsp + +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif +# move the scale and shift factor to another register + movsd %xmm0,%xmm10 + unpcklpd %xmm10,%xmm10 + movsd %xmm1,%xmm11 + unpcklpd %xmm11,%xmm11 + + mov %rdi,save_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vda_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +# In this second version, process the array 2 values at a time. + +.L__vda_top: +# build the input _m128d + mov save_xa(%rsp),%rsi # get x_array pointer + movlpd (%rsi),%xmm0 + movhpd 8(%rsi),%xmm0 + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + movlpd -16(%rsi),%xmm1 + movhpd -8(%rsi),%xmm1 + +# compute the logs + +# movdqa %xmm0,p_x(%rsp) # save the input values + +# use the algorithm referenced in the itanic trancendental paper. + +# reduction +# compute r = x frcpa(x) - 1 + movdqa %xmm0,%xmm8 + movdqa %xmm1,%xmm9 + + call __vrd4_frcpa@PLT + movdqa %xmm8,%xmm4 + movdqa %xmm9,%xmm7 +# invert the exponent + psllq $1,%xmm8 + psllq $1,%xmm9 + mulpd %xmm0,%xmm4 # r + mulpd %xmm1,%xmm7 # r + movdqa %xmm8,%xmm5 + paddq .L__mask_rup(%rip),%xmm8 + psrlq $53,%xmm8 + movdqa %xmm9,%xmm6 + paddq .L__mask_rup(%rip),%xmm6 + psrlq $53,%xmm6 + psubq .L__mask_3ff(%rip),%xmm8 + psubq .L__mask_3ff(%rip),%xmm6 + pshufd $0x058,%xmm8,%xmm8 + pshufd $0x058,%xmm6,%xmm6 + + + subpd .L__real_one(%rip),%xmm4 + subpd .L__real_one(%rip),%xmm7 + + cvtdq2pd %xmm8,%xmm0 #N + cvtdq2pd %xmm6,%xmm1 #N +# movdqa %xmm8,%xmm0 +# movdqa %xmm6,%xmm1 +# compute index for table lookup. if 1/2 bit set, increment the index+exponent + psrlq $42,%xmm5 + psrlq $42,%xmm9 + paddq .L__int_one(%rip),%xmm5 + paddq .L__int_one(%rip),%xmm9 + psrlq $1,%xmm5 + psrlq $1,%xmm9 + pand .L__mask_3ff(%rip),%xmm5 + pand .L__mask_3ff(%rip),%xmm9 + psllq $1,%xmm5 + psllq $1,%xmm9 + + movdqa %xmm5,p_x(%rsp) # move the indexes to a memory location + movdqa %xmm9,p_x2(%rsp) + + + movapd .L__real_third(%rip),%xmm3 + movdqa %xmm3,%xmm5 + movapd %xmm4,%xmm2 + movapd %xmm7,%xmm8 + +# approximation +# compute the polynomial +# p(r) = p1r^2+p2r^3+p3r^4+p4r^5 + + mulpd %xmm4,%xmm2 #r^2 + mulpd %xmm7,%xmm8 #r^2 + + mulpd %xmm4,%xmm3 # 1/3r + mulpd %xmm7,%xmm5 # 1/3r +# lookup the f(k) term + lea .L__np_lnf_table(%rip),%rdx + mov p_x(%rsp),%rcx + mov p_x+8(%rsp),%r9 + movlpd (%rdx,%rcx,8),%xmm6 # lookup + movhpd (%rdx,%r9,8),%xmm6 # lookup + + addpd .L__real_half(%rip),%xmm3 # p2 + p3r + addpd .L__real_half(%rip),%xmm5 # p2 + p3r + + mov p_x2(%rsp),%rcx + mov p_x2+8(%rsp),%r9 + movlpd (%rdx,%rcx,8),%xmm9 # lookup + movhpd (%rdx,%r9,8),%xmm9 # lookup + + mulpd %xmm3,%xmm2 # r2(p2 + p3r) + mulpd %xmm5,%xmm8 # r2(p2 + p3r) + addpd %xmm4,%xmm2 # +r + addpd %xmm7,%xmm8 # +r + + +# reconstruction +# compute ln(x) = T + r + p(r) where +# T = N*ln(2)+ln(1/frcpa(x)) via tab of ln(1/frcpa(y)), where y = 1 + k/256, 0<=k<=255 + + mulpd .L__real_log2(%rip),%xmm0 # compute N*__real_log2 + mulpd .L__real_log2(%rip),%xmm1 # compute N*__real_log2 + addpd %xmm6,%xmm2 # add the new mantissas + addpd %xmm9,%xmm8 # add the new mantissas + addpd %xmm2,%xmm0 + addpd %xmm8,%xmm1 + + +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + mulpd %xmm10,%xmm0 + addpd %xmm11,%xmm0 + movlpd %xmm0,(%rdi) + movhpd %xmm0,8(%rdi) + + + prefetch 64(%rdi) + add $32,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + +# store the result _m128d + mulpd %xmm10,%xmm1 + addpd %xmm11,%xmm1 + movlpd %xmm1,-16(%rdi) + movhpd %xmm1,-8(%rdi) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vda_top + + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vda_cleanup + + +.L__finish: + add $stack_size,%rsp + ret + + .align 16 + + + +# we jump here when we have an odd number of log calls to make at the +# end +# we assume that rdx is pointing at the next x array element, +# r8 at the next y array element. The number of values left is in +# save_nv +.L__vda_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__finish # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + +# fill in a m128d with zeroes and the extra values and then make a recursive call. + xorpd %xmm0,%xmm0 + movlpd %xmm0,p_x+8(%rsp) + movapd %xmm0,p_x+16(%rsp) + + mov (%rsi),%rcx # we know there's at least one + mov %rcx,p_x(%rsp) + cmp $2,%rax + jl .L__vdacg + + mov 8(%rsi),%rcx # do the second value + mov %rcx,p_x+8(%rsp) + cmp $3,%rax + jl .L__vdacg + + mov 16(%rsi),%rcx # do the third value + mov %rcx,p_x+16(%rsp) + +.L__vdacg: + mov $4,%rdi # parameter for N + lea p_x(%rsp),%rsi # &x parameter + lea p2_temp(%rsp),%rdx # &y parameter + movsd %xmm10,%xmm0 + movsd %xmm11,%xmm1 + call vrda_scaledshifted_logr@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + mov p2_temp(%rsp),%rcx + mov %rcx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vdacgf + + mov p2_temp+8(%rsp),%rcx + mov %rcx,8(%rdi) # do the second value + cmp $3,%rax + jl .L__vdacgf + + mov p2_temp+16(%rsp),%rcx + mov %rcx,16(%rdi) # do the third value + +.L__vdacgf: + jmp .L__finish + + .data + .align 64 + +.L__real_one: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 + +.L__real_half: .quad 0x0bfe0000000000000 # 1/2 + .quad 0x0bfe0000000000000 +.L__real_third: .quad 0x03fd5555555555555 # 1/3 + .quad 0x03fd5555555555555 +.L__real_fourth: .quad 0x0bfd0000000000000 # 1/4 + .quad 0x0bfd0000000000000 +.L__real_fifth: .quad 0x03fc999999999999a # 1/5 + .quad 0x03fc999999999999a +.L__real_sixth: .quad 0x0bfc5555555555555 # 1/6 + .quad 0x0bfc5555555555555 + +.L__real_log2: .quad 0x03FE62E42FEFA39EF # 0.693147182465 + .quad 0x03FE62E42FEFA39EF + +.L__mask_3ff: .quad 0x000000000000003ff # + .quad 0x000000000000003ff + +.L__mask_rup: .quad 0x0000003fffffffffe + .quad 0x0000003fffffffffe + +.L__int_one: .quad 0x00000000000000001 + .quad 0x00000000000000001 + + + +.L__mask_10bits: .quad 0x000000000000003ff + .quad 0x000000000000003ff + +.L__mask_expext: .quad 0x000000000003ff000 + .quad 0x000000000003ff000 + +.L__mask_expext2: .quad 0x000000000003ff800 + .quad 0x000000000003ff800 + + + + +.L__np_lnf_table: +#log table Program - logtab.c +#Built Jan 18 2006 09:51:57 +#Compiler version 1400 + + .quad 0x00000000000000000 # 0.000000000000 0 + .quad 0x00000000000000000 + .quad 0x03F50020055655885 # 0.000977039648 1 + .quad 0x03F50020055655885 + .quad 0x03F60040155D5881E # 0.001955034836 2 + .quad 0x03F60040155D5881E + .quad 0x03F6809048289860A # 0.002933987435 3 + .quad 0x03F6809048289860A + .quad 0x03F70080559588B25 # 0.003913899321 4 + .quad 0x03F70080559588B25 + .quad 0x03F740C8A7478788D # 0.004894772377 5 + .quad 0x03F740C8A7478788D + .quad 0x03F78121214586B02 # 0.005876608489 6 + .quad 0x03F78121214586B02 + .quad 0x03F7C189CBB0E283F # 0.006859409551 7 + .quad 0x03F7C189CBB0E283F + .quad 0x03F8010157588DE69 # 0.007843177461 8 + .quad 0x03F8010157588DE69 + .quad 0x03F82145E939EF1BC # 0.008827914124 9 + .quad 0x03F82145E939EF1BC + .quad 0x03F83D8896A83D7A8 # 0.009690354884 10 + .quad 0x03F83D8896A83D7A8 + .quad 0x03F85DDC705054DFF # 0.010676913110 11 + .quad 0x03F85DDC705054DFF + .quad 0x03F87E38762CA0C6D # 0.011664445593 12 + .quad 0x03F87E38762CA0C6D + .quad 0x03F89E9CAC6007563 # 0.012652954261 13 + .quad 0x03F89E9CAC6007563 + .quad 0x03F8BF091710935A4 # 0.013642441046 14 + .quad 0x03F8BF091710935A4 + .quad 0x03F8DF7DBA6777895 # 0.014632907884 15 + .quad 0x03F8DF7DBA6777895 + .quad 0x03F8FBEA8B13C03F9 # 0.015500371846 16 + .quad 0x03F8FBEA8B13C03F9 + .quad 0x03F90E3751F24F45C # 0.016492681528 17 + .quad 0x03F90E3751F24F45C + .quad 0x03F91E7D80B1FBF4C # 0.017485976867 18 + .quad 0x03F91E7D80B1FBF4C + .quad 0x03F92CBE4F6CC56C3 # 0.018355920375 19 + .quad 0x03F92CBE4F6CC56C3 + .quad 0x03F93D0C443D7258C # 0.019351069108 20 + .quad 0x03F93D0C443D7258C + .quad 0x03F94D5E6176ACC89 # 0.020347209148 21 + .quad 0x03F94D5E6176ACC89 + .quad 0x03F95DB4A937DEF10 # 0.021344342472 22 + .quad 0x03F95DB4A937DEF10 + .quad 0x03F96C039490E37F4 # 0.022217650494 23 + .quad 0x03F96C039490E37F4 + .quad 0x03F97C61B1CF5DED7 # 0.023216651576 24 + .quad 0x03F97C61B1CF5DED7 + .quad 0x03F98AB77B3FD6EAD # 0.024091596947 25 + .quad 0x03F98AB77B3FD6EAD + .quad 0x03F99B1D75828E780 # 0.025092472797 26 + .quad 0x03F99B1D75828E780 + .quad 0x03F9AB87A478CB7CB # 0.026094351403 27 + .quad 0x03F9AB87A478CB7CB + .quad 0x03F9B9E8027E1916F # 0.026971819338 28 + .quad 0x03F9B9E8027E1916F + .quad 0x03F9CA5A1A18613E6 # 0.027975583538 29 + .quad 0x03F9CA5A1A18613E6 + .quad 0x03F9D8C1670325921 # 0.028854704473 30 + .quad 0x03F9D8C1670325921 + .quad 0x03F9E93B6EE41F674 # 0.029860361378 31 + .quad 0x03F9E93B6EE41F674 + .quad 0x03F9F7A9B16782855 # 0.030741141554 32 + .quad 0x03F9F7A9B16782855 + .quad 0x03FA0415D89E74440 # 0.031748698315 33 + .quad 0x03FA0415D89E74440 + .quad 0x03FA0C58FA19DFAAB # 0.032757271269 34 + .quad 0x03FA0C58FA19DFAAB + .quad 0x03FA139577CC41C1A # 0.033640607815 35 + .quad 0x03FA139577CC41C1A + .quad 0x03FA1AD398C6CD57C # 0.034524725334 36 + .quad 0x03FA1AD398C6CD57C + .quad 0x03FA231C9C40E204E # 0.035536103423 37 + .quad 0x03FA231C9C40E204E + .quad 0x03FA2A5E4231CF7BD # 0.036421899115 38 + .quad 0x03FA2A5E4231CF7BD + .quad 0x03FA32AB4D4C59CB0 # 0.037435198758 39 + .quad 0x03FA32AB4D4C59CB0 + .quad 0x03FA39F07BA0EBD5A # 0.038322679007 40 + .quad 0x03FA39F07BA0EBD5A + .quad 0x03FA424192495D571 # 0.039337907520 41 + .quad 0x03FA424192495D571 + .quad 0x03FA498A4C73DA65D # 0.040227078744 42 + .quad 0x03FA498A4C73DA65D + .quad 0x03FA50D4AF75CA86F # 0.041117041297 43 + .quad 0x03FA50D4AF75CA86F + .quad 0x03FA592BBC15215BC # 0.042135112141 44 + .quad 0x03FA592BBC15215BC + .quad 0x03FA6079B00423FF6 # 0.043026775152 45 + .quad 0x03FA6079B00423FF6 + .quad 0x03FA67C94F2D4BB65 # 0.043919233935 46 + .quad 0x03FA67C94F2D4BB65 + .quad 0x03FA70265A550E77B # 0.044940163069 47 + .quad 0x03FA70265A550E77B + .quad 0x03FA77798F8D6DFDC # 0.045834331871 48 + .quad 0x03FA77798F8D6DFDC + .quad 0x03FA7ECE7267CD123 # 0.046729300926 49 + .quad 0x03FA7ECE7267CD123 + .quad 0x03FA873184BC09586 # 0.047753104446 50 + .quad 0x03FA873184BC09586 + .quad 0x03FA8E8A02D2E3175 # 0.048649793163 51 + .quad 0x03FA8E8A02D2E3175 + .quad 0x03FA95E430F8CE456 # 0.049547286652 52 + .quad 0x03FA95E430F8CE456 + .quad 0x03FA9D400FF482586 # 0.050445586359 53 + .quad 0x03FA9D400FF482586 + .quad 0x03FAA5AB21CB34A9E # 0.051473203662 54 + .quad 0x03FAA5AB21CB34A9E + .quad 0x03FAAD0AA2E784EF4 # 0.052373235867 55 + .quad 0x03FAAD0AA2E784EF4 + .quad 0x03FAB46BD74DA76A0 # 0.053274078860 56 + .quad 0x03FAB46BD74DA76A0 + .quad 0x03FABBCEBFC68F424 # 0.054175734102 57 + .quad 0x03FABBCEBFC68F424 + .quad 0x03FAC3335D1BBAE4D # 0.055078203060 58 + .quad 0x03FAC3335D1BBAE4D + .quad 0x03FACBA87200EB8F1 # 0.056110594428 59 + .quad 0x03FACBA87200EB8F1 + .quad 0x03FAD310BA20455A2 # 0.057014812019 60 + .quad 0x03FAD310BA20455A2 + .quad 0x03FADA7AB998B77ED # 0.057919847959 61 + .quad 0x03FADA7AB998B77ED + .quad 0x03FAE1E6713606CFB # 0.058825703731 62 + .quad 0x03FAE1E6713606CFB + .quad 0x03FAE953E1C48603A # 0.059732380822 63 + .quad 0x03FAE953E1C48603A + .quad 0x03FAF0C30C1116351 # 0.060639880722 64 + .quad 0x03FAF0C30C1116351 + .quad 0x03FAF833F0E927711 # 0.061548204926 65 + .quad 0x03FAF833F0E927711 + .quad 0x03FAFFA6911AB9309 # 0.062457354934 66 + .quad 0x03FAFFA6911AB9309 + .quad 0x03FB038D76BA2D737 # 0.063367332247 67 + .quad 0x03FB038D76BA2D737 + .quad 0x03FB0748836296412 # 0.064278138373 68 + .quad 0x03FB0748836296412 + .quad 0x03FB0B046EEE6F7A4 # 0.065189774824 69 + .quad 0x03FB0B046EEE6F7A4 + .quad 0x03FB0EC139C5DA5FD # 0.066102243114 70 + .quad 0x03FB0EC139C5DA5FD + .quad 0x03FB127EE451413A8 # 0.067015544762 71 + .quad 0x03FB127EE451413A8 + .quad 0x03FB163D6EF9579FC # 0.067929681294 72 + .quad 0x03FB163D6EF9579FC + .quad 0x03FB19FCDA271ABC0 # 0.068844654235 73 + .quad 0x03FB19FCDA271ABC0 + .quad 0x03FB1DBD2643D1912 # 0.069760465119 74 + .quad 0x03FB1DBD2643D1912 + .quad 0x03FB217E53B90D3CE # 0.070677115481 75 + .quad 0x03FB217E53B90D3CE + .quad 0x03FB254062F0A9417 # 0.071594606862 76 + .quad 0x03FB254062F0A9417 + .quad 0x03FB29035454CBCB0 # 0.072512940806 77 + .quad 0x03FB29035454CBCB0 + .quad 0x03FB2CC7284FE5F1A # 0.073432118863 78 + .quad 0x03FB2CC7284FE5F1A + .quad 0x03FB308BDF4CB4062 # 0.074352142586 79 + .quad 0x03FB308BDF4CB4062 + .quad 0x03FB345179B63DD3F # 0.075273013532 80 + .quad 0x03FB345179B63DD3F + .quad 0x03FB3817F7F7D6EAB # 0.076194733263 81 + .quad 0x03FB3817F7F7D6EAB + .quad 0x03FB3BDF5A7D1EE5E # 0.077117303344 82 + .quad 0x03FB3BDF5A7D1EE5E + .quad 0x03FB3F1D405CE86D3 # 0.077908755701 83 + .quad 0x03FB3F1D405CE86D3 + .quad 0x03FB42E64BEC266E4 # 0.078832909176 84 + .quad 0x03FB42E64BEC266E4 + .quad 0x03FB46B03CF437BC4 # 0.079757917501 85 + .quad 0x03FB46B03CF437BC4 + .quad 0x03FB4A7B13E1E3E65 # 0.080683782259 86 + .quad 0x03FB4A7B13E1E3E65 + .quad 0x03FB4E46D1223FE84 # 0.081610505036 87 + .quad 0x03FB4E46D1223FE84 + .quad 0x03FB52137522AE732 # 0.082538087426 88 + .quad 0x03FB52137522AE732 + .quad 0x03FB5555DE434F2A0 # 0.083333843436 89 + .quad 0x03FB5555DE434F2A0 + .quad 0x03FB59242FF043D34 # 0.084263026485 90 + .quad 0x03FB59242FF043D34 + .quad 0x03FB5CF36997817B2 # 0.085193073719 91 + .quad 0x03FB5CF36997817B2 + .quad 0x03FB60C38BA799459 # 0.086123986746 92 + .quad 0x03FB60C38BA799459 + .quad 0x03FB6408F471C82A2 # 0.086922602521 93 + .quad 0x03FB6408F471C82A2 + .quad 0x03FB67DAC7466CB96 # 0.087855127734 94 + .quad 0x03FB67DAC7466CB96 + .quad 0x03FB6BAD83C1883BA # 0.088788523361 95 + .quad 0x03FB6BAD83C1883BA + .quad 0x03FB6EF528C056A2D # 0.089589270768 96 + .quad 0x03FB6EF528C056A2D + .quad 0x03FB72C9985035BB1 # 0.090524287199 97 + .quad 0x03FB72C9985035BB1 + .quad 0x03FB769EF2C6B5688 # 0.091460178704 98 + .quad 0x03FB769EF2C6B5688 + .quad 0x03FB79E8D70A364C6 # 0.092263069152 99 + .quad 0x03FB79E8D70A364C6 + .quad 0x03FB7DBFE6EA733FE # 0.093200590148 100 + .quad 0x03FB7DBFE6EA733FE + .quad 0x03FB8197E2F40E3F0 # 0.094138990914 101 + .quad 0x03FB8197E2F40E3F0 + .quad 0x03FB84E40992A4804 # 0.094944035906 102 + .quad 0x03FB84E40992A4804 + .quad 0x03FB88BDBD5FC66D2 # 0.095884074919 103 + .quad 0x03FB88BDBD5FC66D2 + .quad 0x03FB8C985E9B9EC7E # 0.096824998438 104 + .quad 0x03FB8C985E9B9EC7E + .quad 0x03FB8FE6CAB20E979 # 0.097632209567 105 + .quad 0x03FB8FE6CAB20E979 + .quad 0x03FB93C3261014C65 # 0.098574780162 106 + .quad 0x03FB93C3261014C65 + .quad 0x03FB97130DC9235DE # 0.099383405543 107 + .quad 0x03FB97130DC9235DE + .quad 0x03FB9AF124D64C623 # 0.100327628989 108 + .quad 0x03FB9AF124D64C623 + .quad 0x03FB9E4289871E964 # 0.101137673586 109 + .quad 0x03FB9E4289871E964 + .quad 0x03FBA2225DD276FCB # 0.102083555691 110 + .quad 0x03FBA2225DD276FCB + .quad 0x03FBA57540D1FE441 # 0.102895024494 111 + .quad 0x03FBA57540D1FE441 + .quad 0x03FBA956D3ECADE60 # 0.103842571097 112 + .quad 0x03FBA956D3ECADE60 + .quad 0x03FBACAB3693AB9C0 # 0.104655469123 113 + .quad 0x03FBACAB3693AB9C0 + .quad 0x03FBB08E8A10F96F4 # 0.105604686090 114 + .quad 0x03FBB08E8A10F96F4 + .quad 0x03FBB3E46DBA02181 # 0.106419018383 115 + .quad 0x03FBB3E46DBA02181 + .quad 0x03FBB7C9832F58018 # 0.107369911615 116 + .quad 0x03FBB7C9832F58018 + .quad 0x03FBBB20E936D6976 # 0.108185683244 117 + .quad 0x03FBBB20E936D6976 + .quad 0x03FBBF07C23BC54EA # 0.109138258671 118 + .quad 0x03FBBF07C23BC54EA + .quad 0x03FBC260ABFFFE972 # 0.109955474734 119 + .quad 0x03FBC260ABFFFE972 + .quad 0x03FBC6494A2E418A0 # 0.110909738320 120 + .quad 0x03FBC6494A2E418A0 + .quad 0x03FBC9A3B90F57748 # 0.111728403941 121 + .quad 0x03FBC9A3B90F57748 + .quad 0x03FBCCFEDBFEE13A8 # 0.112547740324 122 + .quad 0x03FBCCFEDBFEE13A8 + .quad 0x03FBD0EA1362CDBFC # 0.113504482008 123 + .quad 0x03FBD0EA1362CDBFC + .quad 0x03FBD446BD753D433 # 0.114325275488 124 + .quad 0x03FBD446BD753D433 + .quad 0x03FBD7A41C8627307 # 0.115146743223 125 + .quad 0x03FBD7A41C8627307 + .quad 0x03FBDB91F09680DF9 # 0.116105975911 126 + .quad 0x03FBDB91F09680DF9 + .quad 0x03FBDEF0D8D466DBB # 0.116928908339 127 + .quad 0x03FBDEF0D8D466DBB + .quad 0x03FBE2507702AF03B # 0.117752518544 128 + .quad 0x03FBE2507702AF03B + .quad 0x03FBE640EB3D2B411 # 0.118714255240 129 + .quad 0x03FBE640EB3D2B411 + .quad 0x03FBE9A214A69DD58 # 0.119539337795 130 + .quad 0x03FBE9A214A69DD58 + .quad 0x03FBED03F4F440969 # 0.120365101673 131 + .quad 0x03FBED03F4F440969 + .quad 0x03FBF0F70CDD992E4 # 0.121329355484 132 + .quad 0x03FBF0F70CDD992E4 + .quad 0x03FBF45A7A78B7C3B # 0.122156599431 133 + .quad 0x03FBF45A7A78B7C3B + .quad 0x03FBF7BE9FEDBFDED # 0.122984528276 134 + .quad 0x03FBF7BE9FEDBFDED + .quad 0x03FBFB237D8AB13FB # 0.123813143156 135 + .quad 0x03FBFB237D8AB13FB + .quad 0x03FBFF1A13EAC95FD # 0.124780729104 136 + .quad 0x03FBFF1A13EAC95FD + .quad 0x03FC014040CAB0229 # 0.125610834299 137 + .quad 0x03FC014040CAB0229 + .quad 0x03FC02F3D4301417B # 0.126441629140 138 + .quad 0x03FC02F3D4301417B + .quad 0x03FC04A7C44CF87A4 # 0.127273114776 139 + .quad 0x03FC04A7C44CF87A4 + .quad 0x03FC06A4D1D26C5E9 # 0.128244055971 140 + .quad 0x03FC06A4D1D26C5E9 + .quad 0x03FC08598B59E3A07 # 0.129077042275 141 + .quad 0x03FC08598B59E3A07 + .quad 0x03FC0A0EA2164AF02 # 0.129910723024 142 + .quad 0x03FC0A0EA2164AF02 + .quad 0x03FC0BC4162F73B66 # 0.130745099376 143 + .quad 0x03FC0BC4162F73B66 + .quad 0x03FC0D79E7CD48E58 # 0.131580172493 144 + .quad 0x03FC0D79E7CD48E58 + .quad 0x03FC0F301717CF0FB # 0.132415943541 145 + .quad 0x03FC0F301717CF0FB + .quad 0x03FC10E6A437247B7 # 0.133252413686 146 + .quad 0x03FC10E6A437247B7 + .quad 0x03FC12E6BFA8FEAD6 # 0.134229180665 147 + .quad 0x03FC12E6BFA8FEAD6 + .quad 0x03FC149E189F8642E # 0.135067169541 148 + .quad 0x03FC149E189F8642E + .quad 0x03FC1655CFEA923A4 # 0.135905861231 149 + .quad 0x03FC1655CFEA923A4 + .quad 0x03FC180DE5B2ACE5C # 0.136745256915 150 + .quad 0x03FC180DE5B2ACE5C + .quad 0x03FC19C65A207AC07 # 0.137585357777 151 + .quad 0x03FC19C65A207AC07 + .quad 0x03FC1B7F2D5CBA842 # 0.138426165001 152 + .quad 0x03FC1B7F2D5CBA842 + .quad 0x03FC1D385F90453F2 # 0.139267679777 153 + .quad 0x03FC1D385F90453F2 + .quad 0x03FC1EF1F0E40E6CD # 0.140109903297 154 + .quad 0x03FC1EF1F0E40E6CD + .quad 0x03FC20ABE18124098 # 0.140952836755 155 + .quad 0x03FC20ABE18124098 + .quad 0x03FC22663190AEACC # 0.141796481350 156 + .quad 0x03FC22663190AEACC + .quad 0x03FC2420E13BF19E3 # 0.142640838281 157 + .quad 0x03FC2420E13BF19E3 + .quad 0x03FC25DBF0AC4AED2 # 0.143485908754 158 + .quad 0x03FC25DBF0AC4AED2 + .quad 0x03FC2797600B3387B # 0.144331693975 159 + .quad 0x03FC2797600B3387B + .quad 0x03FC29532F823F525 # 0.145178195155 160 + .quad 0x03FC29532F823F525 + .quad 0x03FC2B0F5F3B1D3EF # 0.146025413505 161 + .quad 0x03FC2B0F5F3B1D3EF + .quad 0x03FC2CCBEF5F97653 # 0.146873350243 162 + .quad 0x03FC2CCBEF5F97653 + .quad 0x03FC2E88E01993187 # 0.147722006588 163 + .quad 0x03FC2E88E01993187 + .quad 0x03FC3046319311009 # 0.148571383763 164 + .quad 0x03FC3046319311009 + .quad 0x03FC3203E3F62D328 # 0.149421482992 165 + .quad 0x03FC3203E3F62D328 + .quad 0x03FC33C1F76D1F469 # 0.150272305505 166 + .quad 0x03FC33C1F76D1F469 + .quad 0x03FC35806C223A70F # 0.151123852534 167 + .quad 0x03FC35806C223A70F + .quad 0x03FC373F423FED9A1 # 0.151976125313 168 + .quad 0x03FC373F423FED9A1 + .quad 0x03FC38FE79F0C3771 # 0.152829125080 169 + .quad 0x03FC38FE79F0C3771 + .quad 0x03FC3ABE135F62A12 # 0.153682853077 170 + .quad 0x03FC3ABE135F62A12 + .quad 0x03FC3C335E0447D71 # 0.154394850259 171 + .quad 0x03FC3C335E0447D71 + .quad 0x03FC3DF3AB13505F9 # 0.155249916579 172 + .quad 0x03FC3DF3AB13505F9 + .quad 0x03FC3FB45A59928CA # 0.156105714663 173 + .quad 0x03FC3FB45A59928CA + .quad 0x03FC41756C0220C81 # 0.156962245765 174 + .quad 0x03FC41756C0220C81 + .quad 0x03FC4336E03829D61 # 0.157819511141 175 + .quad 0x03FC4336E03829D61 + .quad 0x03FC44F8B726F8EFE # 0.158677512051 176 + .quad 0x03FC44F8B726F8EFE + .quad 0x03FC46BAF0F9F5DB8 # 0.159536249760 177 + .quad 0x03FC46BAF0F9F5DB8 + .quad 0x03FC48326CD3EC797 # 0.160252428262 178 + .quad 0x03FC48326CD3EC797 + .quad 0x03FC49F55C6502F81 # 0.161112520058 179 + .quad 0x03FC49F55C6502F81 + .quad 0x03FC4BB8AF55DE908 # 0.161973352249 180 + .quad 0x03FC4BB8AF55DE908 + .quad 0x03FC4D7C65D25566D # 0.162834926111 181 + .quad 0x03FC4D7C65D25566D + .quad 0x03FC4F4080065AA7F # 0.163697242922 182 + .quad 0x03FC4F4080065AA7F + .quad 0x03FC50B98CD30A759 # 0.164416408720 183 + .quad 0x03FC50B98CD30A759 + .quad 0x03FC527E5E4A1B58D # 0.165280090939 184 + .quad 0x03FC527E5E4A1B58D + .quad 0x03FC544393F5DF80F # 0.166144519750 185 + .quad 0x03FC544393F5DF80F + .quad 0x03FC56092E02BA514 # 0.167009696444 186 + .quad 0x03FC56092E02BA514 + .quad 0x03FC57837B3098F2C # 0.167731249257 187 + .quad 0x03FC57837B3098F2C + .quad 0x03FC5949CDB873419 # 0.168597800437 188 + .quad 0x03FC5949CDB873419 + .quad 0x03FC5B10851FC924A # 0.169465103180 189 + .quad 0x03FC5B10851FC924A + .quad 0x03FC5C8BC079D8289 # 0.170188430518 190 + .quad 0x03FC5C8BC079D8289 + .quad 0x03FC5E533144C1718 # 0.171057114516 191 + .quad 0x03FC5E533144C1718 + .quad 0x03FC601B076E7A8A8 # 0.171926553783 192 + .quad 0x03FC601B076E7A8A8 + .quad 0x03FC619732215D786 # 0.172651664394 193 + .quad 0x03FC619732215D786 + .quad 0x03FC635FC298F6C77 # 0.173522491735 194 + .quad 0x03FC635FC298F6C77 + .quad 0x03FC6528B8EFA5D16 # 0.174394078077 195 + .quad 0x03FC6528B8EFA5D16 + .quad 0x03FC66A5D42A3AD33 # 0.175120980777 196 + .quad 0x03FC66A5D42A3AD33 + .quad 0x03FC686F85BAD4298 # 0.175993962063 197 + .quad 0x03FC686F85BAD4298 + .quad 0x03FC6A399DABBD383 # 0.176867706111 198 + .quad 0x03FC6A399DABBD383 + .quad 0x03FC6BB7AA9F22C40 # 0.177596409780 199 + .quad 0x03FC6BB7AA9F22C40 + .quad 0x03FC6D827EB7C1E57 # 0.178471555693 200 + .quad 0x03FC6D827EB7C1E57 + .quad 0x03FC6F0128B756AB9 # 0.179201429458 201 + .quad 0x03FC6F0128B756AB9 + .quad 0x03FC70CCB9927BCF6 # 0.180077981742 202 + .quad 0x03FC70CCB9927BCF6 + .quad 0x03FC7298B1A4E32B6 # 0.180955303044 203 + .quad 0x03FC7298B1A4E32B6 + .quad 0x03FC74184F58CC7DC # 0.181686992547 204 + .quad 0x03FC74184F58CC7DC + .quad 0x03FC75E5051E74141 # 0.182565727226 205 + .quad 0x03FC75E5051E74141 + .quad 0x03FC77654128F6127 # 0.183298596442 206 + .quad 0x03FC77654128F6127 + .quad 0x03FC7932B53E97639 # 0.184178749058 207 + .quad 0x03FC7932B53E97639 + .quad 0x03FC7AB390229D8FD # 0.184912801796 208 + .quad 0x03FC7AB390229D8FD + .quad 0x03FC7C81C325B4A5E # 0.185794376934 209 + .quad 0x03FC7C81C325B4A5E + .quad 0x03FC7E033D66CD24A # 0.186529617023 210 + .quad 0x03FC7E033D66CD24A + .quad 0x03FC7FD22FF599D4C # 0.187412619288 211 + .quad 0x03FC7FD22FF599D4C + .quad 0x03FC81544A17F67C1 # 0.188149050576 212 + .quad 0x03FC81544A17F67C1 + .quad 0x03FC8323FCD17DAC8 # 0.189033484595 213 + .quad 0x03FC8323FCD17DAC8 + .quad 0x03FC84A6B759F512D # 0.189771110947 214 + .quad 0x03FC84A6B759F512D + .quad 0x03FC86772ADE0201C # 0.190656981373 215 + .quad 0x03FC86772ADE0201C + .quad 0x03FC87FA865210911 # 0.191395806674 216 + .quad 0x03FC87FA865210911 + .quad 0x03FC89CBBB4136201 # 0.192283118179 217 + .quad 0x03FC89CBBB4136201 + .quad 0x03FC8B4FB826FF291 # 0.193023146334 218 + .quad 0x03FC8B4FB826FF291 + .quad 0x03FC8D21AF2299298 # 0.193911903613 219 + .quad 0x03FC8D21AF2299298 + .quad 0x03FC8EA64E00E7FC0 # 0.194653138545 220 + .quad 0x03FC8EA64E00E7FC0 + .quad 0x03FC902B36AB7681D # 0.195394923313 221 + .quad 0x03FC902B36AB7681D + .quad 0x03FC91FE49096581E # 0.196285791969 222 + .quad 0x03FC91FE49096581E + .quad 0x03FC9383D471B869B # 0.197028789254 223 + .quad 0x03FC9383D471B869B + .quad 0x03FC9557AA6B87F65 # 0.197921115309 224 + .quad 0x03FC9557AA6B87F65 + .quad 0x03FC96DDD91A0B959 # 0.198665329082 225 + .quad 0x03FC96DDD91A0B959 + .quad 0x03FC9864522D04491 # 0.199410097121 226 + .quad 0x03FC9864522D04491 + .quad 0x03FC9A3945D1A44B3 # 0.200304551564 227 + .quad 0x03FC9A3945D1A44B3 + .quad 0x03FC9BC062F26FC3B # 0.201050541900 228 + .quad 0x03FC9BC062F26FC3B + .quad 0x03FC9D47CAD2C1871 # 0.201797089154 229 + .quad 0x03FC9D47CAD2C1871 + .quad 0x03FC9F1DDD7FE4F8B # 0.202693682161 230 + .quad 0x03FC9F1DDD7FE4F8B + .quad 0x03FCA0A5EA371A910 # 0.203441457564 231 + .quad 0x03FCA0A5EA371A910 + .quad 0x03FCA22E42098F498 # 0.204189792554 232 + .quad 0x03FCA22E42098F498 + .quad 0x03FCA405751F6CCE4 # 0.205088534376 233 + .quad 0x03FCA405751F6CCE4 + .quad 0x03FCA58E729348F40 # 0.205838103409 234 + .quad 0x03FCA58E729348F40 + .quad 0x03FCA717BB7EC64A3 # 0.206588234717 235 + .quad 0x03FCA717BB7EC64A3 + .quad 0x03FCA8F010601E5FD # 0.207489135679 236 + .quad 0x03FCA8F010601E5FD + .quad 0x03FCAA79FFB8FCD48 # 0.208240506966 237 + .quad 0x03FCAA79FFB8FCD48 + .quad 0x03FCAC043AE68965A # 0.208992443238 238 + .quad 0x03FCAC043AE68965A + .quad 0x03FCAD8EC205FB6AD # 0.209744945343 239 + .quad 0x03FCAD8EC205FB6AD + .quad 0x03FCAF6895610DBAD # 0.210648695969 240 + .quad 0x03FCAF6895610DBAD + .quad 0x03FCB0F3C3FBD65C9 # 0.211402445910 241 + .quad 0x03FCB0F3C3FBD65C9 + .quad 0x03FCB27F3EE674219 # 0.212156764419 242 + .quad 0x03FCB27F3EE674219 + .quad 0x03FCB40B063E65B0F # 0.212911652354 243 + .quad 0x03FCB40B063E65B0F + .quad 0x03FCB5E65A8096C88 # 0.213818270730 244 + .quad 0x03FCB5E65A8096C88 + .quad 0x03FCB772CA646760C # 0.214574414434 245 + .quad 0x03FCB772CA646760C + .quad 0x03FCB8FF871461198 # 0.215331130323 246 + .quad 0x03FCB8FF871461198 + .quad 0x03FCBA8C90AE4AD19 # 0.216088419265 247 + .quad 0x03FCBA8C90AE4AD19 + .quad 0x03FCBC19E74FFCBDA # 0.216846282128 248 + .quad 0x03FCBC19E74FFCBDA + .quad 0x03FCBDF71B83DAE7A # 0.217756476365 249 + .quad 0x03FCBDF71B83DAE7A + .quad 0x03FCBF851C067555C # 0.218515604922 250 + .quad 0x03FCBF851C067555C + .quad 0x03FCC11369F0CDB3C # 0.219275310193 251 + .quad 0x03FCC11369F0CDB3C + .quad 0x03FCC2A205610593E # 0.220035593055 252 + .quad 0x03FCC2A205610593E + .quad 0x03FCC430EE755023B # 0.220796454387 253 + .quad 0x03FCC430EE755023B + .quad 0x03FCC5C0254BF23A8 # 0.221557895069 254 + .quad 0x03FCC5C0254BF23A8 + .quad 0x03FCC79F9AB632BF1 # 0.222472389875 255 + .quad 0x03FCC79F9AB632BF1 + .quad 0x03FCC92F7D09ABE20 # 0.223235108240 256 + .quad 0x03FCC92F7D09ABE20 + .quad 0x03FCCABFAD80D023D # 0.223998408788 257 + .quad 0x03FCCABFAD80D023D + .quad 0x03FCCC502C3A2F1E8 # 0.224762292410 258 + .quad 0x03FCCC502C3A2F1E8 + .quad 0x03FCCDE0F9546A5E7 # 0.225526759995 259 + .quad 0x03FCCDE0F9546A5E7 + .quad 0x03FCCF7214EE356E9 # 0.226291812439 260 + .quad 0x03FCCF7214EE356E9 + .quad 0x03FCD1037F2655E7B # 0.227057450635 261 + .quad 0x03FCD1037F2655E7B + .quad 0x03FCD295381BA37E9 # 0.227823675483 262 + .quad 0x03FCD295381BA37E9 + .quad 0x03FCD4273FED08111 # 0.228590487882 263 + .quad 0x03FCD4273FED08111 + .quad 0x03FCD5B996B97FB5F # 0.229357888733 264 + .quad 0x03FCD5B996B97FB5F + .quad 0x03FCD74C3CA018C9C # 0.230125878940 265 + .quad 0x03FCD74C3CA018C9C + .quad 0x03FCD8DF31BFF3FF2 # 0.230894459410 266 + .quad 0x03FCD8DF31BFF3FF2 + .quad 0x03FCDA727638446A1 # 0.231663631050 267 + .quad 0x03FCDA727638446A1 + .quad 0x03FCDC56CAE452F5B # 0.232587418645 268 + .quad 0x03FCDC56CAE452F5B + .quad 0x03FCDDEABE5A3926E # 0.233357894066 269 + .quad 0x03FCDDEABE5A3926E + .quad 0x03FCDF7F018CE771F # 0.234128963578 270 + .quad 0x03FCDF7F018CE771F + .quad 0x03FCE113949BDEC62 # 0.234900628096 271 + .quad 0x03FCE113949BDEC62 + .quad 0x03FCE2A877A6B2C0F # 0.235672888541 272 + .quad 0x03FCE2A877A6B2C0F + .quad 0x03FCE43DAACD09BEC # 0.236445745833 273 + .quad 0x03FCE43DAACD09BEC + .quad 0x03FCE5D32E2E9CE87 # 0.237219200895 274 + .quad 0x03FCE5D32E2E9CE87 + .quad 0x03FCE76901EB38427 # 0.237993254653 275 + .quad 0x03FCE76901EB38427 + .quad 0x03FCE8ADE53F76866 # 0.238612929343 276 + .quad 0x03FCE8ADE53F76866 + .quad 0x03FCEA4449F04AAF4 # 0.239388063093 277 + .quad 0x03FCEA4449F04AAF4 + .quad 0x03FCEBDAFF5593E99 # 0.240163798141 278 + .quad 0x03FCEBDAFF5593E99 + .quad 0x03FCED72058F666C5 # 0.240940135421 279 + .quad 0x03FCED72058F666C5 + .quad 0x03FCEF095CBDE9937 # 0.241717075868 280 + .quad 0x03FCEF095CBDE9937 + .quad 0x03FCF0A1050157ED6 # 0.242494620422 281 + .quad 0x03FCF0A1050157ED6 + .quad 0x03FCF238FE79FF4BF # 0.243272770021 282 + .quad 0x03FCF238FE79FF4BF + .quad 0x03FCF3D1494840D2F # 0.244051525609 283 + .quad 0x03FCF3D1494840D2F + .quad 0x03FCF569E58C91077 # 0.244830888130 284 + .quad 0x03FCF569E58C91077 + .quad 0x03FCF702D36777DF0 # 0.245610858531 285 + .quad 0x03FCF702D36777DF0 + .quad 0x03FCF89C12F990D0C # 0.246391437760 286 + .quad 0x03FCF89C12F990D0C + .quad 0x03FCFA35A4638AE2C # 0.247172626770 287 + .quad 0x03FCFA35A4638AE2C + .quad 0x03FCFB7D86EEE3B92 # 0.247798017660 288 + .quad 0x03FCFB7D86EEE3B92 + .quad 0x03FCFD17ABFCDB683 # 0.248580306677 289 + .quad 0x03FCFD17ABFCDB683 + .quad 0x03FCFEB2233EA07CB # 0.249363208150 290 + .quad 0x03FCFEB2233EA07CB + .quad 0x03FD0026766A9671C # 0.250146723037 291 + .quad 0x03FD0026766A9671C + .quad 0x03FD00F40470C7323 # 0.250930852302 292 + .quad 0x03FD00F40470C7323 + .quad 0x03FD01C1BBC2735A3 # 0.251715596908 293 + .quad 0x03FD01C1BBC2735A3 + .quad 0x03FD028F9C7035C1D # 0.252500957822 294 + .quad 0x03FD028F9C7035C1D + .quad 0x03FD03346E0106062 # 0.253129690945 295 + .quad 0x03FD03346E0106062 + .quad 0x03FD0402994B4F041 # 0.253916163656 296 + .quad 0x03FD0402994B4F041 + .quad 0x03FD04D0EE20620AF # 0.254703255393 297 + .quad 0x03FD04D0EE20620AF + .quad 0x03FD059F6C910034D # 0.255490967131 298 + .quad 0x03FD059F6C910034D + .quad 0x03FD066E14ADF4BFD # 0.256279299848 299 + .quad 0x03FD066E14ADF4BFD + .quad 0x03FD07138604D5864 # 0.256910413785 300 + .quad 0x03FD07138604D5864 + .quad 0x03FD07E2794F3E8C1 # 0.257699866735 301 + .quad 0x03FD07E2794F3E8C1 + .quad 0x03FD08B196753A125 # 0.258489943414 302 + .quad 0x03FD08B196753A125 + .quad 0x03FD0980DD87BA2DD # 0.259280644807 303 + .quad 0x03FD0980DD87BA2DD + .quad 0x03FD0A504E97BB40C # 0.260071971904 304 + .quad 0x03FD0A504E97BB40C + .quad 0x03FD0AF660EB9E278 # 0.260705484754 305 + .quad 0x03FD0AF660EB9E278 + .quad 0x03FD0BC61DBBA97CB # 0.261497940616 306 + .quad 0x03FD0BC61DBBA97CB + .quad 0x03FD0C9604B8FC51E # 0.262291024962 307 + .quad 0x03FD0C9604B8FC51E + .quad 0x03FD0D3C7586CD5E5 # 0.262925945618 308 + .quad 0x03FD0D3C7586CD5E5 + .quad 0x03FD0E0CA89A72D29 # 0.263720163752 309 + .quad 0x03FD0E0CA89A72D29 + .quad 0x03FD0EDD060B78082 # 0.264515013170 310 + .quad 0x03FD0EDD060B78082 + .quad 0x03FD0FAD8DEB1E2C0 # 0.265310494876 311 + .quad 0x03FD0FAD8DEB1E2C0 + .quad 0x03FD10547F9D26ABC # 0.265947336165 312 + .quad 0x03FD10547F9D26ABC + .quad 0x03FD1125540925114 # 0.266743958529 313 + .quad 0x03FD1125540925114 + .quad 0x03FD11F653144CB8B # 0.267541216005 314 + .quad 0x03FD11F653144CB8B + .quad 0x03FD129DA43F5BE9E # 0.268179479949 315 + .quad 0x03FD129DA43F5BE9E + .quad 0x03FD136EF02E8290C # 0.268977883185 316 + .quad 0x03FD136EF02E8290C + .quad 0x03FD144066EDAE406 # 0.269776924378 317 + .quad 0x03FD144066EDAE406 + .quad 0x03FD14E817FF359D7 # 0.270416617347 318 + .quad 0x03FD14E817FF359D7 + .quad 0x03FD15B9DBFA9DEC8 # 0.271216809436 319 + .quad 0x03FD15B9DBFA9DEC8 + .quad 0x03FD168BCAF73B3EB # 0.272017642345 320 + .quad 0x03FD168BCAF73B3EB + .quad 0x03FD1733DC5D68DE8 # 0.272658770753 321 + .quad 0x03FD1733DC5D68DE8 + .quad 0x03FD180618EF18ADE # 0.273460759729 322 + .quad 0x03FD180618EF18ADE + .quad 0x03FD18D880B3826FE # 0.274263392407 323 + .quad 0x03FD18D880B3826FE + .quad 0x03FD1980F2DD42B6F # 0.274905962710 324 + .quad 0x03FD1980F2DD42B6F + .quad 0x03FD1A53A8902E70B # 0.275709756661 325 + .quad 0x03FD1A53A8902E70B + .quad 0x03FD1AFC59297024D # 0.276353257326 326 + .quad 0x03FD1AFC59297024D + .quad 0x03FD1BCF5D04AE1EA # 0.277158215914 327 + .quad 0x03FD1BCF5D04AE1EA + .quad 0x03FD1CA28C64BAE54 # 0.277963822983 328 + .quad 0x03FD1CA28C64BAE54 + .quad 0x03FD1D4B9E796C245 # 0.278608776246 329 + .quad 0x03FD1D4B9E796C245 + .quad 0x03FD1E1F1C5C3A06C # 0.279415553216 330 + .quad 0x03FD1E1F1C5C3A06C + .quad 0x03FD1EC86D5747AAD # 0.280061443760 331 + .quad 0x03FD1EC86D5747AAD + .quad 0x03FD1F9C39F74C559 # 0.280869394034 332 + .quad 0x03FD1F9C39F74C559 + .quad 0x03FD2070326F1F789 # 0.281677997620 333 + .quad 0x03FD2070326F1F789 + .quad 0x03FD2119E59F8789C # 0.282325351583 334 + .quad 0x03FD2119E59F8789C + .quad 0x03FD21EE2D300381C # 0.283135133796 335 + .quad 0x03FD21EE2D300381C + .quad 0x03FD22981FBEF797A # 0.283783432036 336 + .quad 0x03FD22981FBEF797A + .quad 0x03FD236CB6A339EED # 0.284594396317 337 + .quad 0x03FD236CB6A339EED + .quad 0x03FD2416E8C01F606 # 0.285243641592 338 + .quad 0x03FD2416E8C01F606 + .quad 0x03FD24EBCF3387FF6 # 0.286055791397 339 + .quad 0x03FD24EBCF3387FF6 + .quad 0x03FD2596410DF963A # 0.286705986479 340 + .quad 0x03FD2596410DF963A + .quad 0x03FD266B774C2AF55 # 0.287519325279 341 + .quad 0x03FD266B774C2AF55 + .quad 0x03FD27162913F873F # 0.288170472950 342 + .quad 0x03FD27162913F873F + .quad 0x03FD27EBAF58D8C9C # 0.288985004232 343 + .quad 0x03FD27EBAF58D8C9C + .quad 0x03FD2896A13E086A3 # 0.289637107288 344 + .quad 0x03FD2896A13E086A3 + .quad 0x03FD296C77C5C0E13 # 0.290452834554 345 + .quad 0x03FD296C77C5C0E13 + .quad 0x03FD2A17A9F88EDD2 # 0.291105895801 346 + .quad 0x03FD2A17A9F88EDD2 + .quad 0x03FD2AEDD0FF8CC2C # 0.291922822568 347 + .quad 0x03FD2AEDD0FF8CC2C + .quad 0x03FD2B9943B06BD77 # 0.292576844829 348 + .quad 0x03FD2B9943B06BD77 + .quad 0x03FD2C6FBB7360D0E # 0.293394974630 349 + .quad 0x03FD2C6FBB7360D0E + .quad 0x03FD2D1B6ED2FA90C # 0.294049960734 350 + .quad 0x03FD2D1B6ED2FA90C + .quad 0x03FD2DC73F01B0DD4 # 0.294705376127 351 + .quad 0x03FD2DC73F01B0DD4 + .quad 0x03FD2E9E2BCE12286 # 0.295525249913 352 + .quad 0x03FD2E9E2BCE12286 + .quad 0x03FD2F4A3CF22EDC2 # 0.296181633264 353 + .quad 0x03FD2F4A3CF22EDC2 + .quad 0x03FD30217B1006601 # 0.297002718785 354 + .quad 0x03FD30217B1006601 + .quad 0x03FD30CDCD5ABA762 # 0.297660072959 355 + .quad 0x03FD30CDCD5ABA762 + .quad 0x03FD31A55D07A8590 # 0.298482373803 356 + .quad 0x03FD31A55D07A8590 + .quad 0x03FD3251F0AA5CC1A # 0.299140701674 357 + .quad 0x03FD3251F0AA5CC1A + .quad 0x03FD32FEA167A6D70 # 0.299799463226 358 + .quad 0x03FD32FEA167A6D70 + .quad 0x03FD33D6A7509D491 # 0.300623525901 359 + .quad 0x03FD33D6A7509D491 + .quad 0x03FD348399ADA9D94 # 0.301283265328 360 + .quad 0x03FD348399ADA9D94 + .quad 0x03FD3530A9454ADC9 # 0.301943440298 361 + .quad 0x03FD3530A9454ADC9 + .quad 0x03FD360925EC44F5C # 0.302769272371 362 + .quad 0x03FD360925EC44F5C + .quad 0x03FD36B6776BE1116 # 0.303430429420 363 + .quad 0x03FD36B6776BE1116 + .quad 0x03FD378F469437FB4 # 0.304257490918 364 + .quad 0x03FD378F469437FB4 + .quad 0x03FD383CDA2E14ECB # 0.304919632971 365 + .quad 0x03FD383CDA2E14ECB + .quad 0x03FD38EA8B3924521 # 0.305582213748 366 + .quad 0x03FD38EA8B3924521 + .quad 0x03FD39C3D1FD60E74 # 0.306411057558 367 + .quad 0x03FD39C3D1FD60E74 + .quad 0x03FD3A71C56BB48C7 # 0.307074627589 368 + .quad 0x03FD3A71C56BB48C7 + .quad 0x03FD3B1FD66BC8D10 # 0.307738638238 369 + .quad 0x03FD3B1FD66BC8D10 + .quad 0x03FD3BF995502CB5C # 0.308569272059 370 + .quad 0x03FD3BF995502CB5C + .quad 0x03FD3CA7E8FD01DF6 # 0.309234276240 371 + .quad 0x03FD3CA7E8FD01DF6 + .quad 0x03FD3D565A5C5BF11 # 0.309899722945 372 + .quad 0x03FD3D565A5C5BF11 + .quad 0x03FD3E3091E6049FB # 0.310732154526 373 + .quad 0x03FD3E3091E6049FB + .quad 0x03FD3EDF463C1683E # 0.311398599069 374 + .quad 0x03FD3EDF463C1683E + .quad 0x03FD3F8E1865A82DD # 0.312065488057 375 + .quad 0x03FD3F8E1865A82DD + .quad 0x03FD403D086CEA79B # 0.312732822082 376 + .quad 0x03FD403D086CEA79B + .quad 0x03FD4117DE854CA15 # 0.313567616354 377 + .quad 0x03FD4117DE854CA15 + .quad 0x03FD41C711E4BA15E # 0.314235953889 378 + .quad 0x03FD41C711E4BA15E + .quad 0x03FD427663431B221 # 0.314904738398 379 + .quad 0x03FD427663431B221 + .quad 0x03FD4325D2AAB6F18 # 0.315573970480 380 + .quad 0x03FD4325D2AAB6F18 + .quad 0x03FD44014838E5513 # 0.316411140893 381 + .quad 0x03FD44014838E5513 + .quad 0x03FD44B0FB5AF4F44 # 0.317081382205 382 + .quad 0x03FD44B0FB5AF4F44 + .quad 0x03FD4560CCA7CB3B2 # 0.317752073041 383 + .quad 0x03FD4560CCA7CB3B2 + .quad 0x03FD4610BC29C5E18 # 0.318423214006 384 + .quad 0x03FD4610BC29C5E18 + .quad 0x03FD46ECD216CDCB5 # 0.319262774126 385 + .quad 0x03FD46ECD216CDCB5 + .quad 0x03FD479D05B65CB60 # 0.319934930091 386 + .quad 0x03FD479D05B65CB60 + .quad 0x03FD484D57ACE5A1A # 0.320607538154 387 + .quad 0x03FD484D57ACE5A1A + .quad 0x03FD48FDC804DD1CB # 0.321280598924 388 + .quad 0x03FD48FDC804DD1CB + .quad 0x03FD49DA7F3BCC420 # 0.322122562432 389 + .quad 0x03FD49DA7F3BCC420 + .quad 0x03FD4A8B341552B09 # 0.322796644021 390 + .quad 0x03FD4A8B341552B09 + .quad 0x03FD4B3C077267E9A # 0.323471180303 391 + .quad 0x03FD4B3C077267E9A + .quad 0x03FD4BECF95D97914 # 0.324146171892 392 + .quad 0x03FD4BECF95D97914 + .quad 0x03FD4C9E09E172C3D # 0.324821619401 393 + .quad 0x03FD4C9E09E172C3D + .quad 0x03FD4D4F3908901A0 # 0.325497523449 394 + .quad 0x03FD4D4F3908901A0 + .quad 0x03FD4E2CDF1F341C1 # 0.326343046455 395 + .quad 0x03FD4E2CDF1F341C1 + .quad 0x03FD4EDE535C79642 # 0.327019979972 396 + .quad 0x03FD4EDE535C79642 + .quad 0x03FD4F8FE65F90500 # 0.327697372039 397 + .quad 0x03FD4F8FE65F90500 + .quad 0x03FD5041983326F2D # 0.328375223276 398 + .quad 0x03FD5041983326F2D + .quad 0x03FD50F368E1F0F02 # 0.329053534308 399 + .quad 0x03FD50F368E1F0F02 + .quad 0x03FD51A55876A77F5 # 0.329732305758 400 + .quad 0x03FD51A55876A77F5 + .quad 0x03FD5283EF743F98B # 0.330581418486 401 + .quad 0x03FD5283EF743F98B + .quad 0x03FD533624B59CA35 # 0.331261228165 402 + .quad 0x03FD533624B59CA35 + .quad 0x03FD53E878FFE6EAE # 0.331941500300 403 + .quad 0x03FD53E878FFE6EAE + .quad 0x03FD549AEC5DEF880 # 0.332622235521 404 + .quad 0x03FD549AEC5DEF880 + .quad 0x03FD554D7EDA8D3C4 # 0.333303434457 405 + .quad 0x03FD554D7EDA8D3C4 + .quad 0x03FD560030809C759 # 0.333985097742 406 + .quad 0x03FD560030809C759 + .quad 0x03FD56B3015AFF52C # 0.334667226008 407 + .quad 0x03FD56B3015AFF52C + .quad 0x03FD5765F1749DA6C # 0.335349819892 408 + .quad 0x03FD5765F1749DA6C + .quad 0x03FD581900D864FD7 # 0.336032880027 409 + .quad 0x03FD581900D864FD7 + .quad 0x03FD58CC2F91489F5 # 0.336716407053 410 + .quad 0x03FD58CC2F91489F5 + .quad 0x03FD59AC5618CCE38 # 0.337571473373 411 + .quad 0x03FD59AC5618CCE38 + .quad 0x03FD5A5FCB795780C # 0.338256053239 412 + .quad 0x03FD5A5FCB795780C + .quad 0x03FD5B136052BCE39 # 0.338941102075 413 + .quad 0x03FD5B136052BCE39 + .quad 0x03FD5BC714B008E23 # 0.339626620526 414 + .quad 0x03FD5BC714B008E23 + .quad 0x03FD5C7AE89C4D254 # 0.340312609234 415 + .quad 0x03FD5C7AE89C4D254 + .quad 0x03FD5D2EDC22A12BA # 0.340999068845 416 + .quad 0x03FD5D2EDC22A12BA + .quad 0x03FD5DE2EF4E224D6 # 0.341686000008 417 + .quad 0x03FD5DE2EF4E224D6 + .quad 0x03FD5E972229F3C15 # 0.342373403369 418 + .quad 0x03FD5E972229F3C15 + .quad 0x03FD5F4B74C13EA04 # 0.343061279578 419 + .quad 0x03FD5F4B74C13EA04 + .quad 0x03FD5FFFE71F31E9A # 0.343749629287 420 + .quad 0x03FD5FFFE71F31E9A + .quad 0x03FD60B4794F02875 # 0.344438453147 421 + .quad 0x03FD60B4794F02875 + .quad 0x03FD61692B5BEB520 # 0.345127751813 422 + .quad 0x03FD61692B5BEB520 + .quad 0x03FD621DFD512D14F # 0.345817525940 423 + .quad 0x03FD621DFD512D14F + .quad 0x03FD62D2EF3A0E933 # 0.346507776183 424 + .quad 0x03FD62D2EF3A0E933 + .quad 0x03FD63880121DC8AB # 0.347198503200 425 + .quad 0x03FD63880121DC8AB + .quad 0x03FD643D3313E9B92 # 0.347889707652 426 + .quad 0x03FD643D3313E9B92 + .quad 0x03FD64F2851B8EE01 # 0.348581390197 427 + .quad 0x03FD64F2851B8EE01 + .quad 0x03FD65A7F7442AC90 # 0.349273551498 428 + .quad 0x03FD65A7F7442AC90 + .quad 0x03FD665D8999224A5 # 0.349966192218 429 + .quad 0x03FD665D8999224A5 + .quad 0x03FD67133C25E04A5 # 0.350659313022 430 + .quad 0x03FD67133C25E04A5 + .quad 0x03FD67C90EF5D5C4C # 0.351352914576 431 + .quad 0x03FD67C90EF5D5C4C + .quad 0x03FD687F021479CEE # 0.352046997547 432 + .quad 0x03FD687F021479CEE + .quad 0x03FD6935158D499B3 # 0.352741562603 433 + .quad 0x03FD6935158D499B3 + .quad 0x03FD69EB496BC87E5 # 0.353436610416 434 + .quad 0x03FD69EB496BC87E5 + .quad 0x03FD6AA19DBB7FF34 # 0.354132141656 435 + .quad 0x03FD6AA19DBB7FF34 + .quad 0x03FD6B581287FF9FD # 0.354828156996 436 + .quad 0x03FD6B581287FF9FD + .quad 0x03FD6C0EA7DCDD591 # 0.355524657112 437 + .quad 0x03FD6C0EA7DCDD591 + .quad 0x03FD6C97AD3CFCFD9 # 0.356047350738 438 + .quad 0x03FD6C97AD3CFCFD9 + .quad 0x03FD6D4E7B9C727EC # 0.356744700836 439 + .quad 0x03FD6D4E7B9C727EC + .quad 0x03FD6E056AA4421D6 # 0.357442537571 440 + .quad 0x03FD6E056AA4421D6 + .quad 0x03FD6EBC7A6019066 # 0.358140861621 441 + .quad 0x03FD6EBC7A6019066 + .quad 0x03FD6F73AADBAAAB7 # 0.358839673669 442 + .quad 0x03FD6F73AADBAAAB7 + .quad 0x03FD702AFC22B0C6D # 0.359538974397 443 + .quad 0x03FD702AFC22B0C6D + .quad 0x03FD70E26E40EB5FA # 0.360238764489 444 + .quad 0x03FD70E26E40EB5FA + .quad 0x03FD719A014220CF5 # 0.360939044629 445 + .quad 0x03FD719A014220CF5 + .quad 0x03FD7251B5321DC54 # 0.361639815506 446 + .quad 0x03FD7251B5321DC54 + .quad 0x03FD73098A1CB54BA # 0.362341077807 447 + .quad 0x03FD73098A1CB54BA + .quad 0x03FD73937F783CEBA # 0.362867347444 448 + .quad 0x03FD73937F783CEBA + .quad 0x03FD744B8E35E9EDA # 0.363569471398 449 + .quad 0x03FD744B8E35E9EDA + .quad 0x03FD7503BE0ED6C66 # 0.364272088676 450 + .quad 0x03FD7503BE0ED6C66 + .quad 0x03FD75BC0F0EEE7DE # 0.364975199972 451 + .quad 0x03FD75BC0F0EEE7DE + .quad 0x03FD76748142228C7 # 0.365678805982 452 + .quad 0x03FD76748142228C7 + .quad 0x03FD772D14B46AE00 # 0.366382907402 453 + .quad 0x03FD772D14B46AE00 + .quad 0x03FD77E5C971C5E06 # 0.367087504930 454 + .quad 0x03FD77E5C971C5E06 + .quad 0x03FD787066E04915F # 0.367616279067 455 + .quad 0x03FD787066E04915F + .quad 0x03FD792955FDF47A3 # 0.368321746469 456 + .quad 0x03FD792955FDF47A3 + .quad 0x03FD79E26687CFB3D # 0.369027711906 457 + .quad 0x03FD79E26687CFB3D + .quad 0x03FD7A9B9889F19E2 # 0.369734176082 458 + .quad 0x03FD7A9B9889F19E2 + .quad 0x03FD7B54EC1077A48 # 0.370441139703 459 + .quad 0x03FD7B54EC1077A48 + .quad 0x03FD7C0E612785C74 # 0.371148603475 460 + .quad 0x03FD7C0E612785C74 + .quad 0x03FD7C998F06FB152 # 0.371679529954 461 + .quad 0x03FD7C998F06FB152 + .quad 0x03FD7D533EF841E8A # 0.372387870696 462 + .quad 0x03FD7D533EF841E8A + .quad 0x03FD7E0D109B95F19 # 0.373096713539 463 + .quad 0x03FD7E0D109B95F19 + .quad 0x03FD7EC703FD340AA # 0.373806059198 464 + .quad 0x03FD7EC703FD340AA + .quad 0x03FD7F8119295FB9B # 0.374515908385 465 + .quad 0x03FD7F8119295FB9B + .quad 0x03FD800CBF3ED1CC2 # 0.375048626146 466 + .quad 0x03FD800CBF3ED1CC2 + .quad 0x03FD80C70FAB0BDF6 # 0.375759358229 467 + .quad 0x03FD80C70FAB0BDF6 + .quad 0x03FD81818203AFC7F # 0.376470595813 468 + .quad 0x03FD81818203AFC7F + .quad 0x03FD823C16551A3C3 # 0.377182339615 469 + .quad 0x03FD823C16551A3C3 + .quad 0x03FD82C81BE4DFF4A # 0.377716480107 470 + .quad 0x03FD82C81BE4DFF4A + .quad 0x03FD8382EBC7794D1 # 0.378429111528 471 + .quad 0x03FD8382EBC7794D1 + .quad 0x03FD843DDDC4FB137 # 0.379142251156 472 + .quad 0x03FD843DDDC4FB137 + .quad 0x03FD84F8F1E9DB72B # 0.379855899714 473 + .quad 0x03FD84F8F1E9DB72B + .quad 0x03FD85855776DCBFB # 0.380391470556 474 + .quad 0x03FD85855776DCBFB + .quad 0x03FD8640A77EB3957 # 0.381106011494 475 + .quad 0x03FD8640A77EB3957 + .quad 0x03FD86FC19D05148E # 0.381821063366 476 + .quad 0x03FD86FC19D05148E + .quad 0x03FD87B7AE7845C0F # 0.382536626902 477 + .quad 0x03FD87B7AE7845C0F + .quad 0x03FD8844748678822 # 0.383073635776 478 + .quad 0x03FD8844748678822 + .quad 0x03FD89004563D3DFD # 0.383790096491 479 + .quad 0x03FD89004563D3DFD + .quad 0x03FD89BC38BA356B4 # 0.384507070890 480 + .quad 0x03FD89BC38BA356B4 + .quad 0x03FD8A4945E20894E # 0.385045139237 481 + .quad 0x03FD8A4945E20894E + .quad 0x03FD8B0575AAB1FC5 # 0.385763014358 482 + .quad 0x03FD8B0575AAB1FC5 + .quad 0x03FD8BC1C80F45A32 # 0.386481405193 483 + .quad 0x03FD8BC1C80F45A32 + .quad 0x03FD8C7E3D1C80B2F # 0.387200312485 484 + .quad 0x03FD8C7E3D1C80B2F + .quad 0x03FD8D0BABACC89EE # 0.387739832326 485 + .quad 0x03FD8D0BABACC89EE + .quad 0x03FD8DC85D7FE5013 # 0.388459645206 486 + .quad 0x03FD8DC85D7FE5013 + .quad 0x03FD8E85321ED5598 # 0.389179976589 487 + .quad 0x03FD8E85321ED5598 + .quad 0x03FD8F12E873862C7 # 0.389720565845 488 + .quad 0x03FD8F12E873862C7 + .quad 0x03FD8FCFFA1614AA0 # 0.390441806410 489 + .quad 0x03FD8FCFFA1614AA0 + .quad 0x03FD908D2EA7D9511 # 0.391163567538 490 + .quad 0x03FD908D2EA7D9511 + .quad 0x03FD911B2D09ED9D6 # 0.391705230456 491 + .quad 0x03FD911B2D09ED9D6 + .quad 0x03FD91D89EDD6B7FF # 0.392427904381 492 + .quad 0x03FD91D89EDD6B7FF + .quad 0x03FD929633C3B7D3E # 0.393151100941 493 + .quad 0x03FD929633C3B7D3E + .quad 0x03FD93247A7C99B52 # 0.393693841796 494 + .quad 0x03FD93247A7C99B52 + .quad 0x03FD93E24CE3195E8 # 0.394417954789 495 + .quad 0x03FD93E24CE3195E8 + .quad 0x03FD9470C1CB1962E # 0.394961383840 496 + .quad 0x03FD9470C1CB1962E + .quad 0x03FD952ED1D9C0435 # 0.395686415592 497 + .quad 0x03FD952ED1D9C0435 + .quad 0x03FD95ED0535EA5D9 # 0.396411973396 498 + .quad 0x03FD95ED0535EA5D9 + .quad 0x03FD967BC2EDCCE17 # 0.396956487431 499 + .quad 0x03FD967BC2EDCCE17 + .quad 0x03FD973A3431356AE # 0.397682967666 500 + .quad 0x03FD973A3431356AE + .quad 0x03FD97F8C8E64A1C7 # 0.398409976059 501 + .quad 0x03FD97F8C8E64A1C7 + .quad 0x03FD9887CFB8A3932 # 0.398955579419 502 + .quad 0x03FD9887CFB8A3932 + .quad 0x03FD9946A2946EF3C # 0.399683513937 503 + .quad 0x03FD9946A2946EF3C + .quad 0x03FD99D5D8130607C # 0.400229812776 504 + .quad 0x03FD99D5D8130607C + .quad 0x03FD9A94E93E1EC37 # 0.400958675782 505 + .quad 0x03FD9A94E93E1EC37 + .quad 0x03FD9B244D87735E8 # 0.401505671875 506 + .quad 0x03FD9B244D87735E8 + .quad 0x03FD9BE39D2A97F0B # 0.402235465741 507 + .quad 0x03FD9BE39D2A97F0B + .quad 0x03FD9CA3109266E23 # 0.402965792595 508 + .quad 0x03FD9CA3109266E23 + .quad 0x03FD9D32BEA15ED3A # 0.403513887977 509 + .quad 0x03FD9D32BEA15ED3A + .quad 0x03FD9DF270C1914A8 # 0.404245149435 510 + .quad 0x03FD9DF270C1914A8 + .quad 0x03FD9E824DEA3E135 # 0.404793946669 511 + .quad 0x03FD9E824DEA3E135 + .quad 0x03FD9F423EEBF9DA1 # 0.405526145127 512 + .quad 0x03FD9F423EEBF9DA1 + .quad 0x03FD9FD24B4D47012 # 0.406075646011 513 + .quad 0x03FD9FD24B4D47012 + .quad 0x03FDA0927B59DA6E2 # 0.406808783874 514 + .quad 0x03FDA0927B59DA6E2 + .quad 0x03FDA152CF7F3B46D # 0.407542459622 515 + .quad 0x03FDA152CF7F3B46D + .quad 0x03FDA1E32653B420E # 0.408093069896 516 + .quad 0x03FDA1E32653B420E + .quad 0x03FDA2A3B9C527DB1 # 0.408827688845 517 + .quad 0x03FDA2A3B9C527DB1 + .quad 0x03FDA33440224FA79 # 0.409379007429 518 + .quad 0x03FDA33440224FA79 + .quad 0x03FDA3F513098DD09 # 0.410114572008 519 + .quad 0x03FDA3F513098DD09 + .quad 0x03FDA485C90EBDB0C # 0.410666600728 520 + .quad 0x03FDA485C90EBDB0C + .quad 0x03FDA546DB95A721A # 0.411403113374 521 + .quad 0x03FDA546DB95A721A + .quad 0x03FDA5D7C16257437 # 0.411955854060 522 + .quad 0x03FDA5D7C16257437 + .quad 0x03FDA69913B2F6572 # 0.412693317221 523 + .quad 0x03FDA69913B2F6572 + .quad 0x03FDA72A2966BE1EA # 0.413246771713 524 + .quad 0x03FDA72A2966BE1EA + .quad 0x03FDA7EBBBAB46E8B # 0.413985187844 525 + .quad 0x03FDA7EBBBAB46E8B + .quad 0x03FDA87D0165DD199 # 0.414539357989 526 + .quad 0x03FDA87D0165DD199 + .quad 0x03FDA93ED3C8AD9E3 # 0.415278729556 527 + .quad 0x03FDA93ED3C8AD9E3 + .quad 0x03FDA9D049A9E884A # 0.415833617206 528 + .quad 0x03FDA9D049A9E884A + .quad 0x03FDAA925C5588EFA # 0.416573946686 529 + .quad 0x03FDAA925C5588EFA + .quad 0x03FDAB24027D5E8AF # 0.417129553701 530 + .quad 0x03FDAB24027D5E8AF + .quad 0x03FDABE6559C8167C # 0.417870843580 531 + .quad 0x03FDABE6559C8167C + .quad 0x03FDAC782C2B07944 # 0.418427171828 532 + .quad 0x03FDAC782C2B07944 + .quad 0x03FDAD3ABFE88A06E # 0.419169424599 533 + .quad 0x03FDAD3ABFE88A06E + .quad 0x03FDADCCC6FDF6A80 # 0.419726475955 534 + .quad 0x03FDADCCC6FDF6A80 + .quad 0x03FDAE5EE2E961227 # 0.420283837790 535 + .quad 0x03FDAE5EE2E961227 + .quad 0x03FDAF21D34189D0A # 0.421027470470 536 + .quad 0x03FDAF21D34189D0A + .quad 0x03FDAFB41FE2167B4 # 0.421585558104 537 + .quad 0x03FDAFB41FE2167B4 + .quad 0x03FDB07751416A7F3 # 0.422330159776 538 + .quad 0x03FDB07751416A7F3 + .quad 0x03FDB109CEB79DB8A # 0.422888975102 539 + .quad 0x03FDB109CEB79DB8A + .quad 0x03FDB1CD41498DF12 # 0.423634548296 540 + .quad 0x03FDB1CD41498DF12 + .quad 0x03FDB25FEFB60CB2E # 0.424194093214 541 + .quad 0x03FDB25FEFB60CB2E + .quad 0x03FDB323A3A63594A # 0.424940640468 542 + .quad 0x03FDB323A3A63594A + .quad 0x03FDB3B68329C59E9 # 0.425500916886 543 + .quad 0x03FDB3B68329C59E9 + .quad 0x03FDB44977C148F1A # 0.426061507389 544 + .quad 0x03FDB44977C148F1A + .quad 0x03FDB50D895F7773A # 0.426809450580 545 + .quad 0x03FDB50D895F7773A + .quad 0x03FDB5A0AF3D169CD # 0.427370775322 546 + .quad 0x03FDB5A0AF3D169CD + .quad 0x03FDB66502A41E541 # 0.428119698779 547 + .quad 0x03FDB66502A41E541 + .quad 0x03FDB6F859E8EF639 # 0.428681759684 548 + .quad 0x03FDB6F859E8EF639 + .quad 0x03FDB78BC664238C0 # 0.429244136679 549 + .quad 0x03FDB78BC664238C0 + .quad 0x03FDB85078123E586 # 0.429994464983 550 + .quad 0x03FDB85078123E586 + .quad 0x03FDB8E41624226C5 # 0.430557580905 551 + .quad 0x03FDB8E41624226C5 + .quad 0x03FDB9A90A06BCB3D # 0.431308895742 552 + .quad 0x03FDB9A90A06BCB3D + .quad 0x03FDBA3CD9D0B81BD # 0.431872752537 553 + .quad 0x03FDBA3CD9D0B81BD + .quad 0x03FDBAD0BEF3DB164 # 0.432436927446 554 + .quad 0x03FDBAD0BEF3DB164 + .quad 0x03FDBB9611B80E2FC # 0.433189656123 555 + .quad 0x03FDBB9611B80E2FC + .quad 0x03FDBC2A28C33B75D # 0.433754574696 556 + .quad 0x03FDBC2A28C33B75D + .quad 0x03FDBCBE553C2BDDF # 0.434319812582 557 + .quad 0x03FDBCBE553C2BDDF + .quad 0x03FDBD84073D8EC2B # 0.435073960430 558 + .quad 0x03FDBD84073D8EC2B + .quad 0x03FDBE1865CEC1EC9 # 0.435639944787 559 + .quad 0x03FDBE1865CEC1EC9 + .quad 0x03FDBEACD9E271AD1 # 0.436206249662 560 + .quad 0x03FDBEACD9E271AD1 + .quad 0x03FDBF72EB7D20355 # 0.436961822044 561 + .quad 0x03FDBF72EB7D20355 + .quad 0x03FDC00791D99132B # 0.437528876213 562 + .quad 0x03FDC00791D99132B + .quad 0x03FDC09C4DCD565AB # 0.438096252115 563 + .quad 0x03FDC09C4DCD565AB + .quad 0x03FDC162BF5DF23E4 # 0.438853254422 564 + .quad 0x03FDC162BF5DF23E4 + .quad 0x03FDC1F7ADCB3DAB0 # 0.439421382456 565 + .quad 0x03FDC1F7ADCB3DAB0 + .quad 0x03FDC28CB1E4D32FD # 0.439989833442 566 + .quad 0x03FDC28CB1E4D32FD + .quad 0x03FDC35383C8850B0 # 0.440748271097 567 + .quad 0x03FDC35383C8850B0 + .quad 0x03FDC3E8BA8CACF27 # 0.441317477070 568 + .quad 0x03FDC3E8BA8CACF27 + .quad 0x03FDC47E071233744 # 0.441887007223 569 + .quad 0x03FDC47E071233744 + .quad 0x03FDC54539A6ABCD2 # 0.442646885679 570 + .quad 0x03FDC54539A6ABCD2 + .quad 0x03FDC5DAB908186FF # 0.443217173690 571 + .quad 0x03FDC5DAB908186FF + .quad 0x03FDC6704E4016FF7 # 0.443787787115 572 + .quad 0x03FDC6704E4016FF7 + .quad 0x03FDC737E1E38F4FB # 0.444549111857 573 + .quad 0x03FDC737E1E38F4FB + .quad 0x03FDC7CDAA290FEAD # 0.445120486027 574 + .quad 0x03FDC7CDAA290FEAD + .quad 0x03FDC863885A74D16 # 0.445692186852 575 + .quad 0x03FDC863885A74D16 + .quad 0x03FDC8F97C7E299DB # 0.446264214707 576 + .quad 0x03FDC8F97C7E299DB + .quad 0x03FDC9C18EDC7C26B # 0.447027427871 577 + .quad 0x03FDC9C18EDC7C26B + .quad 0x03FDCA57B64E9DB05 # 0.447600220249 578 + .quad 0x03FDCA57B64E9DB05 + .quad 0x03FDCAEDF3C88A364 # 0.448173340907 579 + .quad 0x03FDCAEDF3C88A364 + .quad 0x03FDCB844750B9995 # 0.448746790220 580 + .quad 0x03FDCB844750B9995 + .quad 0x03FDCC4CD90B3ECE5 # 0.449511901199 581 + .quad 0x03FDCC4CD90B3ECE5 + .quad 0x03FDCCE3602341C10 # 0.450086118843 582 + .quad 0x03FDCCE3602341C10 + .quad 0x03FDCD79FD5F2BC77 # 0.450660666403 583 + .quad 0x03FDCD79FD5F2BC77 + .quad 0x03FDCE10B0C581284 # 0.451235544257 584 + .quad 0x03FDCE10B0C581284 + .quad 0x03FDCED9C27EC6607 # 0.452002562511 585 + .quad 0x03FDCED9C27EC6607 + .quad 0x03FDCF70A9B6D3810 # 0.452578212532 586 + .quad 0x03FDCF70A9B6D3810 + .quad 0x03FDD007A72F19BBC # 0.453154194116 587 + .quad 0x03FDD007A72F19BBC + .quad 0x03FDD09EBAEE29DD8 # 0.453730507647 588 + .quad 0x03FDD09EBAEE29DD8 + .quad 0x03FDD1684D49F46AE # 0.454499442710 589 + .quad 0x03FDD1684D49F46AE + .quad 0x03FDD1FF951D1F1B3 # 0.455076532271 590 + .quad 0x03FDD1FF951D1F1B3 + .quad 0x03FDD296F34D0B65C # 0.455653955057 591 + .quad 0x03FDD296F34D0B65C + .quad 0x03FDD32E67E056BD5 # 0.456231711452 592 + .quad 0x03FDD32E67E056BD5 + .quad 0x03FDD3C5F2DDA1840 # 0.456809801843 593 + .quad 0x03FDD3C5F2DDA1840 + .quad 0x03FDD490246DEFA6A # 0.457581109247 594 + .quad 0x03FDD490246DEFA6A + .quad 0x03FDD527E3D1B95FC # 0.458159980465 595 + .quad 0x03FDD527E3D1B95FC + .quad 0x03FDD5BFB9B5AE71F # 0.458739186968 596 + .quad 0x03FDD5BFB9B5AE71F + .quad 0x03FDD657A6207C0DB # 0.459318729146 597 + .quad 0x03FDD657A6207C0DB + .quad 0x03FDD6EFA918D25CE # 0.459898607388 598 + .quad 0x03FDD6EFA918D25CE + .quad 0x03FDD7BA7AD9E7DA1 # 0.460672301817 599 + .quad 0x03FDD7BA7AD9E7DA1 + .quad 0x03FDD852B28BE5A0F # 0.461252965726 600 + .quad 0x03FDD852B28BE5A0F + .quad 0x03FDD8EB00E1CCE14 # 0.461833967001 601 + .quad 0x03FDD8EB00E1CCE14 + .quad 0x03FDD98365E25ABB9 # 0.462415306035 602 + .quad 0x03FDD98365E25ABB9 + .quad 0x03FDDA1BE1944F538 # 0.462996983220 603 + .quad 0x03FDDA1BE1944F538 + .quad 0x03FDDAE75484C9615 # 0.463773079495 604 + .quad 0x03FDDAE75484C9615 + .quad 0x03FDDB8005445488B # 0.464355547233 605 + .quad 0x03FDDB8005445488B + .quad 0x03FDDC18CCCBDCB83 # 0.464938354438 606 + .quad 0x03FDDC18CCCBDCB83 + .quad 0x03FDDCB1AB222F33D # 0.465521501504 607 + .quad 0x03FDDCB1AB222F33D + .quad 0x03FDDD4AA04E1C4B7 # 0.466104988830 608 + .quad 0x03FDDD4AA04E1C4B7 + .quad 0x03FDDDE3AC56775D2 # 0.466688816812 609 + .quad 0x03FDDDE3AC56775D2 + .quad 0x03FDDE7CCF4216D6E # 0.467272985848 610 + .quad 0x03FDDE7CCF4216D6E + .quad 0x03FDDF492177D7BBC # 0.468052409114 611 + .quad 0x03FDDF492177D7BBC + .quad 0x03FDDFE279E5BF4EE # 0.468637375496 612 + .quad 0x03FDDFE279E5BF4EE + .quad 0x03FDE07BE94DCC439 # 0.469222684263 613 + .quad 0x03FDE07BE94DCC439 + .quad 0x03FDE1156FB6E2626 # 0.469808335817 614 + .quad 0x03FDE1156FB6E2626 + .quad 0x03FDE1AF0D27E88D7 # 0.470394330560 615 + .quad 0x03FDE1AF0D27E88D7 + .quad 0x03FDE248C1A7C8C26 # 0.470980668894 616 + .quad 0x03FDE248C1A7C8C26 + .quad 0x03FDE2E28D3D701CC # 0.471567351222 617 + .quad 0x03FDE2E28D3D701CC + .quad 0x03FDE37C6FEFCED73 # 0.472154377948 618 + .quad 0x03FDE37C6FEFCED73 + .quad 0x03FDE449C232C39D8 # 0.472937616681 619 + .quad 0x03FDE449C232C39D8 + .quad 0x03FDE4E3DAEDDB5F6 # 0.473525448578 620 + .quad 0x03FDE4E3DAEDDB5F6 + .quad 0x03FDE57E0ADCE1EA5 # 0.474113626224 621 + .quad 0x03FDE57E0ADCE1EA5 + .quad 0x03FDE6185206D516F # 0.474702150027 622 + .quad 0x03FDE6185206D516F + .quad 0x03FDE6B2B072B5E6F # 0.475291020395 623 + .quad 0x03FDE6B2B072B5E6F + .quad 0x03FDE74D26278887A # 0.475880237735 624 + .quad 0x03FDE74D26278887A + .quad 0x03FDE7E7B32C5453F # 0.476469802457 625 + .quad 0x03FDE7E7B32C5453F + .quad 0x03FDE882578823D52 # 0.477059714970 626 + .quad 0x03FDE882578823D52 + .quad 0x03FDE91D134204C67 # 0.477649975686 627 + .quad 0x03FDE91D134204C67 + .quad 0x03FDE9B7E6610815A # 0.478240585015 628 + .quad 0x03FDE9B7E6610815A + .quad 0x03FDEA52D0EC41E5E # 0.478831543369 629 + .quad 0x03FDEA52D0EC41E5E + .quad 0x03FDEB218376ECFC0 # 0.479620031484 630 + .quad 0x03FDEB218376ECFC0 + .quad 0x03FDEBBCA4C4E9E87 # 0.480211805838 631 + .quad 0x03FDEBBCA4C4E9E87 + .quad 0x03FDEC57DD96CD0CB # 0.480803930597 632 + .quad 0x03FDEC57DD96CD0CB + .quad 0x03FDECF32DF3B887D # 0.481396406174 633 + .quad 0x03FDECF32DF3B887D + .quad 0x03FDED8E95E2D1B88 # 0.481989232987 634 + .quad 0x03FDED8E95E2D1B88 + .quad 0x03FDEE2A156B413E5 # 0.482582411453 635 + .quad 0x03FDEE2A156B413E5 + .quad 0x03FDEEC5AC9432FCB # 0.483175941987 636 + .quad 0x03FDEEC5AC9432FCB + .quad 0x03FDEF615B64D61C7 # 0.483769825010 637 + .quad 0x03FDEF615B64D61C7 + .quad 0x03FDEFFD21E45D0D1 # 0.484364060939 638 + .quad 0x03FDEFFD21E45D0D1 + .quad 0x03FDF0990019FD887 # 0.484958650194 639 + .quad 0x03FDF0990019FD887 + .quad 0x03FDF134F60CF092D # 0.485553593197 640 + .quad 0x03FDF134F60CF092D + .quad 0x03FDF1D103C4727E4 # 0.486148890367 641 + .quad 0x03FDF1D103C4727E4 + .quad 0x03FDF26D2947C2EC5 # 0.486744542127 642 + .quad 0x03FDF26D2947C2EC5 + .quad 0x03FDF309669E24CF9 # 0.487340548899 643 + .quad 0x03FDF309669E24CF9 + .quad 0x03FDF3A5BBCEDE6E1 # 0.487936911107 644 + .quad 0x03FDF3A5BBCEDE6E1 + .quad 0x03FDF44228E13963A # 0.488533629176 645 + .quad 0x03FDF44228E13963A + .quad 0x03FDF4DEADDC82A35 # 0.489130703529 646 + .quad 0x03FDF4DEADDC82A35 + .quad 0x03FDF57B4AC80A79A # 0.489728134594 647 + .quad 0x03FDF57B4AC80A79A + .quad 0x03FDF617FFAB248ED # 0.490325922795 648 + .quad 0x03FDF617FFAB248ED + .quad 0x03FDF6B4CC8D27E87 # 0.490924068561 649 + .quad 0x03FDF6B4CC8D27E87 + .quad 0x03FDF751B1756EEC8 # 0.491522572320 650 + .quad 0x03FDF751B1756EEC8 + .quad 0x03FDF7EEAE6B5761C # 0.492121434499 651 + .quad 0x03FDF7EEAE6B5761C + .quad 0x03FDF88BC3764273B # 0.492720655530 652 + .quad 0x03FDF88BC3764273B + .quad 0x03FDF928F09D94B32 # 0.493320235842 653 + .quad 0x03FDF928F09D94B32 + .quad 0x03FDF9C635E8B6192 # 0.493920175866 654 + .quad 0x03FDF9C635E8B6192 + .quad 0x03FDFA63935F1208C # 0.494520476034 655 + .quad 0x03FDFA63935F1208C + .quad 0x03FDFB0109081751A # 0.495121136779 656 + .quad 0x03FDFB0109081751A + .quad 0x03FDFB9E96EB38311 # 0.495722158534 657 + .quad 0x03FDFB9E96EB38311 + .quad 0x03FDFC3C3D0FEA555 # 0.496323541733 658 + .quad 0x03FDFC3C3D0FEA555 + .quad 0x03FDFCD9FB7DA6DEF # 0.496925286812 659 + .quad 0x03FDFCD9FB7DA6DEF + .quad 0x03FDFD77D23BEA634 # 0.497527394206 660 + .quad 0x03FDFD77D23BEA634 + .quad 0x03FDFE15C15234EE2 # 0.498129864352 661 + .quad 0x03FDFE15C15234EE2 + .quad 0x03FDFEB3C8C80A04E # 0.498732697687 662 + .quad 0x03FDFEB3C8C80A04E + .quad 0x03FDFF51E8A4F0A74 # 0.499335894649 663 + .quad 0x03FDFF51E8A4F0A74 + .quad 0x03FDFFF020F07352E # 0.499939455677 664 + .quad 0x03FDFFF020F07352E + .quad 0x03FE004738D910023 # 0.500543381211 665 + .quad 0x03FE004738D910023 + .quad 0x03FE00966D78C41CF # 0.501147671692 666 + .quad 0x03FE00966D78C41CF + .quad 0x03FE00E5AE5B207AB # 0.501752327560 667 + .quad 0x03FE00E5AE5B207AB + .quad 0x03FE011A8B18F0ED6 # 0.502155634684 668 + .quad 0x03FE011A8B18F0ED6 + .quad 0x03FE0169E072D7311 # 0.502760900515 669 + .quad 0x03FE0169E072D7311 + .quad 0x03FE01B942198A5A1 # 0.503366532915 670 + .quad 0x03FE01B942198A5A1 + .quad 0x03FE0208B010DB642 # 0.503972532327 671 + .quad 0x03FE0208B010DB642 + .quad 0x03FE02582A5C9D122 # 0.504578899198 672 + .quad 0x03FE02582A5C9D122 + .quad 0x03FE02A7B100A3EF0 # 0.505185633972 673 + .quad 0x03FE02A7B100A3EF0 + .quad 0x03FE02F74400C64EA # 0.505792737097 674 + .quad 0x03FE02F74400C64EA + .quad 0x03FE0346E360DC4F9 # 0.506400209020 675 + .quad 0x03FE0346E360DC4F9 + .quad 0x03FE03968F24BFDB6 # 0.507008050190 676 + .quad 0x03FE03968F24BFDB6 + .quad 0x03FE03E647504CA89 # 0.507616261055 677 + .quad 0x03FE03E647504CA89 + .quad 0x03FE04360BE7603AE # 0.508224842066 678 + .quad 0x03FE04360BE7603AE + .quad 0x03FE046B4089BE0FD # 0.508630768599 679 + .quad 0x03FE046B4089BE0FD + .quad 0x03FE04BB19DCA36B3 # 0.509239967521 680 + .quad 0x03FE04BB19DCA36B3 + .quad 0x03FE050AFFA5671A5 # 0.509849537793 681 + .quad 0x03FE050AFFA5671A5 + .quad 0x03FE055AF1E7ED47B # 0.510459479867 682 + .quad 0x03FE055AF1E7ED47B + .quad 0x03FE05AAF0A81BF04 # 0.511069794198 683 + .quad 0x03FE05AAF0A81BF04 + .quad 0x03FE05FAFBE9DAE58 # 0.511680481240 684 + .quad 0x03FE05FAFBE9DAE58 + .quad 0x03FE064B13B113CDD # 0.512291541448 685 + .quad 0x03FE064B13B113CDD + .quad 0x03FE069B3801B2263 # 0.512902975280 686 + .quad 0x03FE069B3801B2263 + .quad 0x03FE06D0AC85B63A2 # 0.513310805628 687 + .quad 0x03FE06D0AC85B63A2 + .quad 0x03FE0720E5C40DF1D # 0.513922863181 688 + .quad 0x03FE0720E5C40DF1D + .quad 0x03FE07712B9648153 # 0.514535295577 689 + .quad 0x03FE07712B9648153 + .quad 0x03FE07C17E0056E7C # 0.515148103277 690 + .quad 0x03FE07C17E0056E7C + .quad 0x03FE0811DD062E889 # 0.515761286740 691 + .quad 0x03FE0811DD062E889 + .quad 0x03FE086248ABC4F3B # 0.516374846428 692 + .quad 0x03FE086248ABC4F3B + .quad 0x03FE08B2C0F512033 # 0.516988782802 693 + .quad 0x03FE08B2C0F512033 + .quad 0x03FE08E86D82DA3EE # 0.517398283218 694 + .quad 0x03FE08E86D82DA3EE + .quad 0x03FE0938FAE5D8E9B # 0.518012848432 695 + .quad 0x03FE0938FAE5D8E9B + .quad 0x03FE098994F72C539 # 0.518627791569 696 + .quad 0x03FE098994F72C539 + .quad 0x03FE09DA3BBAD339C # 0.519243113094 697 + .quad 0x03FE09DA3BBAD339C + .quad 0x03FE0A2AEF34CE3D1 # 0.519858813473 698 + .quad 0x03FE0A2AEF34CE3D1 + .quad 0x03FE0A7BAF691FE34 # 0.520474893172 699 + .quad 0x03FE0A7BAF691FE34 + .quad 0x03FE0AB18BF5823C3 # 0.520885823936 700 + .quad 0x03FE0AB18BF5823C3 + .quad 0x03FE0B02616952989 # 0.521502536876 701 + .quad 0x03FE0B02616952989 + .quad 0x03FE0B5343A234476 # 0.522119630385 702 + .quad 0x03FE0B5343A234476 + .quad 0x03FE0BA432A430CA2 # 0.522737104934 703 + .quad 0x03FE0BA432A430CA2 + .quad 0x03FE0BF52E73538CE # 0.523354960993 704 + .quad 0x03FE0BF52E73538CE + .quad 0x03FE0C463713A9E6F # 0.523973199034 705 + .quad 0x03FE0C463713A9E6F + .quad 0x03FE0C7C43F4C861E # 0.524385570174 706 + .quad 0x03FE0C7C43F4C861E + .quad 0x03FE0CCD61FAD07D2 # 0.525004445903 707 + .quad 0x03FE0CCD61FAD07D2 + .quad 0x03FE0D1E8CDCE3DB6 # 0.525623704876 708 + .quad 0x03FE0D1E8CDCE3DB6 + .quad 0x03FE0D6FC49F16E93 # 0.526243347569 709 + .quad 0x03FE0D6FC49F16E93 + .quad 0x03FE0DC109458004A # 0.526863374456 710 + .quad 0x03FE0DC109458004A + .quad 0x03FE0DF73E353F0ED # 0.527276939392 711 + .quad 0x03FE0DF73E353F0ED + .quad 0x03FE0E4898611CCE1 # 0.527897607665 712 + .quad 0x03FE0E4898611CCE1 + .quad 0x03FE0E99FF7C20738 # 0.528518661406 713 + .quad 0x03FE0E99FF7C20738 + .quad 0x03FE0EEB738A67874 # 0.529140101094 714 + .quad 0x03FE0EEB738A67874 + .quad 0x03FE0F21C81D1ADC3 # 0.529554608872 715 + .quad 0x03FE0F21C81D1ADC3 + .quad 0x03FE0F7351C9FCD7F # 0.530176692874 716 + .quad 0x03FE0F7351C9FCD7F + .quad 0x03FE0FC4E875254C1 # 0.530799164104 717 + .quad 0x03FE0FC4E875254C1 + .quad 0x03FE10168C22B8FB9 # 0.531422023047 718 + .quad 0x03FE10168C22B8FB9 + .quad 0x03FE10683CD6DEA54 # 0.532045270185 719 + .quad 0x03FE10683CD6DEA54 + .quad 0x03FE109EB9E2E4C97 # 0.532460984179 720 + .quad 0x03FE109EB9E2E4C97 + .quad 0x03FE10F08055E7785 # 0.533084879385 721 + .quad 0x03FE10F08055E7785 + .quad 0x03FE114253DA97DA0 # 0.533709164079 722 + .quad 0x03FE114253DA97DA0 + .quad 0x03FE1194347523FDC # 0.534333838748 723 + .quad 0x03FE1194347523FDC + .quad 0x03FE11CAD1789B0F8 # 0.534750505421 724 + .quad 0x03FE11CAD1789B0F8 + .quad 0x03FE121CC7EB8F7E6 # 0.535375831132 725 + .quad 0x03FE121CC7EB8F7E6 + .quad 0x03FE126ECB7F8F007 # 0.536001548120 726 + .quad 0x03FE126ECB7F8F007 + .quad 0x03FE12A57FDA37091 # 0.536418910396 727 + .quad 0x03FE12A57FDA37091 + .quad 0x03FE12F799594EFBC # 0.537045280601 728 + .quad 0x03FE12F799594EFBC + .quad 0x03FE1349C004AFB00 # 0.537672043392 729 + .quad 0x03FE1349C004AFB00 + .quad 0x03FE139BF3E094003 # 0.538299199261 730 + .quad 0x03FE139BF3E094003 + .quad 0x03FE13D2C873C5E13 # 0.538717521794 731 + .quad 0x03FE13D2C873C5E13 + .quad 0x03FE142512549C16C # 0.539345333889 732 + .quad 0x03FE142512549C16C + .quad 0x03FE14776971477F1 # 0.539973540381 733 + .quad 0x03FE14776971477F1 + .quad 0x03FE14C9CDCE0A74D # 0.540602141763 734 + .quad 0x03FE14C9CDCE0A74D + .quad 0x03FE1500C2BFD1561 # 0.541021428981 735 + .quad 0x03FE1500C2BFD1561 + .quad 0x03FE15533D3B8D7B3 # 0.541650689621 736 + .quad 0x03FE15533D3B8D7B3 + .quad 0x03FE15A5C502C6DC5 # 0.542280346478 737 + .quad 0x03FE15A5C502C6DC5 + .quad 0x03FE15DCD1973457B # 0.542700338085 738 + .quad 0x03FE15DCD1973457B + .quad 0x03FE162F6F9071F76 # 0.543330656416 739 + .quad 0x03FE162F6F9071F76 + .quad 0x03FE16821AE0A13C6 # 0.543961372300 740 + .quad 0x03FE16821AE0A13C6 + .quad 0x03FE16B93F2C12808 # 0.544382070665 741 + .quad 0x03FE16B93F2C12808 + .quad 0x03FE170C00C169B51 # 0.545013450251 742 + .quad 0x03FE170C00C169B51 + .quad 0x03FE175ECFB935CC6 # 0.545645228728 743 + .quad 0x03FE175ECFB935CC6 + .quad 0x03FE17B1AC17CBD5B # 0.546277406602 744 + .quad 0x03FE17B1AC17CBD5B + .quad 0x03FE17E8F12052E8A # 0.546699080654 745 + .quad 0x03FE17E8F12052E8A + .quad 0x03FE183BE3DE8A7AF # 0.547331925312 746 + .quad 0x03FE183BE3DE8A7AF + .quad 0x03FE188EE40F23CA7 # 0.547965170715 747 + .quad 0x03FE188EE40F23CA7 + .quad 0x03FE18C640FF75F06 # 0.548387557205 748 + .quad 0x03FE18C640FF75F06 + .quad 0x03FE191957A30FA51 # 0.549021471648 749 + .quad 0x03FE191957A30FA51 + .quad 0x03FE196C7BC4B1F3A # 0.549655788193 750 + .quad 0x03FE196C7BC4B1F3A + .quad 0x03FE19A3F0B1860BD # 0.550078889532 751 + .quad 0x03FE19A3F0B1860BD + .quad 0x03FE19F72B59A0CEC # 0.550713877383 752 + .quad 0x03FE19F72B59A0CEC + .quad 0x03FE1A4A738B7A33C # 0.551349268700 753 + .quad 0x03FE1A4A738B7A33C + .quad 0x03FE1A820089A2156 # 0.551773087312 754 + .quad 0x03FE1A820089A2156 + .quad 0x03FE1AD55F55855C8 # 0.552409152212 755 + .quad 0x03FE1AD55F55855C8 + .quad 0x03FE1B28CBB6EC93E # 0.553045621948 756 + .quad 0x03FE1B28CBB6EC93E + .quad 0x03FE1B6070DB553D8 # 0.553470160269 757 + .quad 0x03FE1B6070DB553D8 + .quad 0x03FE1BB3F3EA714F6 # 0.554107305878 758 + .quad 0x03FE1BB3F3EA714F6 + .quad 0x03FE1BEBA8316EF2C # 0.554532295260 759 + .quad 0x03FE1BEBA8316EF2C + .quad 0x03FE1C3F41FA97C6B # 0.555170118179 760 + .quad 0x03FE1C3F41FA97C6B + .quad 0x03FE1C92E96C86020 # 0.555808348176 761 + .quad 0x03FE1C92E96C86020 + .quad 0x03FE1CCAB5FBFFEE1 # 0.556234061252 762 + .quad 0x03FE1CCAB5FBFFEE1 + .quad 0x03FE1D1E743BCFC47 # 0.556872970868 763 + .quad 0x03FE1D1E743BCFC47 + .quad 0x03FE1D72403052E75 # 0.557512288951 764 + .quad 0x03FE1D72403052E75 + .quad 0x03FE1DAA251D7E433 # 0.557938728190 765 + .quad 0x03FE1DAA251D7E433 + .quad 0x03FE1DFE07F3D1DAB # 0.558578728212 766 + .quad 0x03FE1DFE07F3D1DAB + .quad 0x03FE1E35FC265D75E # 0.559005622562 767 + .quad 0x03FE1E35FC265D75E + .quad 0x03FE1E89F5EB04126 # 0.559646305979 768 + .quad 0x03FE1E89F5EB04126 + .quad 0x03FE1EDDFD77E1FEF # 0.560287400135 769 + .quad 0x03FE1EDDFD77E1FEF + .quad 0x03FE1F160A2AD0DA3 # 0.560715024687 770 + .quad 0x03FE1F160A2AD0DA3 + .quad 0x03FE1F6A28BA1B476 # 0.561356804579 771 + .quad 0x03FE1F6A28BA1B476 + .quad 0x03FE1FBE551DB43C1 # 0.561998996616 772 + .quad 0x03FE1FBE551DB43C1 + .quad 0x03FE1FF67A6684F47 # 0.562427353873 773 + .quad 0x03FE1FF67A6684F47 + .quad 0x03FE204ABDE0BE5DF # 0.563070233998 774 + .quad 0x03FE204ABDE0BE5DF + .quad 0x03FE2082F29233211 # 0.563499050471 775 + .quad 0x03FE2082F29233211 + .quad 0x03FE20D74D2FBAFE4 # 0.564142620160 776 + .quad 0x03FE20D74D2FBAFE4 + .quad 0x03FE210F91524B469 # 0.564571896835 777 + .quad 0x03FE210F91524B469 + .quad 0x03FE2164031FDA0B0 # 0.565216157568 778 + .quad 0x03FE2164031FDA0B0 + .quad 0x03FE21B882DD26040 # 0.565860833641 779 + .quad 0x03FE21B882DD26040 + .quad 0x03FE21F0DFC65CEEC # 0.566290848698 780 + .quad 0x03FE21F0DFC65CEEC + .quad 0x03FE224576C81FFE0 # 0.566936218194 781 + .quad 0x03FE224576C81FFE0 + .quad 0x03FE227DE33896A44 # 0.567366696031 782 + .quad 0x03FE227DE33896A44 + .quad 0x03FE22D2918BA4A31 # 0.568012760445 783 + .quad 0x03FE22D2918BA4A31 + .quad 0x03FE23274DE272A83 # 0.568659242528 784 + .quad 0x03FE23274DE272A83 + .quad 0x03FE235FD33D232FC # 0.569090462888 785 + .quad 0x03FE235FD33D232FC + .quad 0x03FE23B4A6F9D8688 # 0.569737642287 786 + .quad 0x03FE23B4A6F9D8688 + .quad 0x03FE23ED3BF21CA33 # 0.570169328026 787 + .quad 0x03FE23ED3BF21CA33 + .quad 0x03FE24422721A89D7 # 0.570817206248 788 + .quad 0x03FE24422721A89D7 + .quad 0x03FE247ACBC023D2B # 0.571249358372 789 + .quad 0x03FE247ACBC023D2B + .quad 0x03FE24CFCE6F80D9B # 0.571897936927 790 + .quad 0x03FE24CFCE6F80D9B + .quad 0x03FE250882BCDD7D8 # 0.572330556445 791 + .quad 0x03FE250882BCDD7D8 + .quad 0x03FE255D9CF910A56 # 0.572979836849 792 + .quad 0x03FE255D9CF910A56 + .quad 0x03FE25B2C55CD5762 # 0.573629539091 793 + .quad 0x03FE25B2C55CD5762 + .quad 0x03FE25EB92D41992D # 0.574062908546 794 + .quad 0x03FE25EB92D41992D + .quad 0x03FE2640D2D99FFEA # 0.574713315073 795 + .quad 0x03FE2640D2D99FFEA + .quad 0x03FE2679B0166F51C # 0.575147154559 796 + .quad 0x03FE2679B0166F51C + .quad 0x03FE26CF07CAD8B00 # 0.575798266899 797 + .quad 0x03FE26CF07CAD8B00 + .quad 0x03FE2707F4D5F7C40 # 0.576232577438 798 + .quad 0x03FE2707F4D5F7C40 + .quad 0x03FE275D644670606 # 0.576884397124 799 + .quad 0x03FE275D644670606 + .quad 0x03FE27966128AB11B # 0.577319179739 800 + .quad 0x03FE27966128AB11B + .quad 0x03FE27EBE8626A387 # 0.577971708311 801 + .quad 0x03FE27EBE8626A387 + .quad 0x03FE2824F52493BD2 # 0.578406964030 802 + .quad 0x03FE2824F52493BD2 + .quad 0x03FE287A9434DBC7B # 0.579060203030 803 + .quad 0x03FE287A9434DBC7B + .quad 0x03FE28B3B0DFCEB80 # 0.579495932884 804 + .quad 0x03FE28B3B0DFCEB80 + .quad 0x03FE290967D3ED18D # 0.580149883861 805 + .quad 0x03FE290967D3ED18D + .quad 0x03FE294294708B773 # 0.580586088885 806 + .quad 0x03FE294294708B773 + .quad 0x03FE29986355D8C69 # 0.581240753393 807 + .quad 0x03FE29986355D8C69 + .quad 0x03FE29D19FED0C082 # 0.581677434622 808 + .quad 0x03FE29D19FED0C082 + .quad 0x03FE2A2786D0EC107 # 0.582332814220 809 + .quad 0x03FE2A2786D0EC107 + .quad 0x03FE2A60D36BA5253 # 0.582769972697 810 + .quad 0x03FE2A60D36BA5253 + .quad 0x03FE2AB6D25B86EF7 # 0.583426068948 811 + .quad 0x03FE2AB6D25B86EF7 + .quad 0x03FE2AF02F02BE4AB # 0.583863705716 812 + .quad 0x03FE2AF02F02BE4AB + .quad 0x03FE2B46460C1C2B3 # 0.584520520190 813 + .quad 0x03FE2B46460C1C2B3 + .quad 0x03FE2B7FB2C8D1CC1 # 0.584958636297 814 + .quad 0x03FE2B7FB2C8D1CC1 + .quad 0x03FE2BD5E1F9316F2 # 0.585616170568 815 + .quad 0x03FE2BD5E1F9316F2 + .quad 0x03FE2C0F5ED46CE8D # 0.586054767066 816 + .quad 0x03FE2C0F5ED46CE8D + .quad 0x03FE2C65A6395F5F5 # 0.586713022712 817 + .quad 0x03FE2C65A6395F5F5 + .quad 0x03FE2C9F333C2FE1E # 0.587152100656 818 + .quad 0x03FE2C9F333C2FE1E + .quad 0x03FE2CF592E351AE5 # 0.587811079263 819 + .quad 0x03FE2CF592E351AE5 + .quad 0x03FE2D2F3016CE0EF # 0.588250639709 820 + .quad 0x03FE2D2F3016CE0EF + .quad 0x03FE2D85A80DC7324 # 0.588910342867 821 + .quad 0x03FE2D85A80DC7324 + .quad 0x03FE2DBF557B0DF43 # 0.589350386878 822 + .quad 0x03FE2DBF557B0DF43 + .quad 0x03FE2E15E5CF91FA7 # 0.590010816181 823 + .quad 0x03FE2E15E5CF91FA7 + .quad 0x03FE2E4FA37FC9577 # 0.590451344823 824 + .quad 0x03FE2E4FA37FC9577 + .quad 0x03FE2E8967B3BF4E1 # 0.590892067615 825 + .quad 0x03FE2E8967B3BF4E1 + .quad 0x03FE2EE01A3BED567 # 0.591553516212 826 + .quad 0x03FE2EE01A3BED567 + .quad 0x03FE2F19EEBFB00BA # 0.591994725131 827 + .quad 0x03FE2F19EEBFB00BA + .quad 0x03FE2F70B9C67A7C2 # 0.592656903723 828 + .quad 0x03FE2F70B9C67A7C2 + .quad 0x03FE2FAA9EA342D04 # 0.593098599843 829 + .quad 0x03FE2FAA9EA342D04 + .quad 0x03FE3001823684D73 # 0.593761510043 830 + .quad 0x03FE3001823684D73 + .quad 0x03FE303B7775937EF # 0.594203694441 831 + .quad 0x03FE303B7775937EF + .quad 0x03FE309273A3340FC # 0.594867337868 832 + .quad 0x03FE309273A3340FC + .quad 0x03FE30CC794DD19D0 # 0.595310011625 833 + .quad 0x03FE30CC794DD19D0 + .quad 0x03FE3106858C76BB7 # 0.595752881428 834 + .quad 0x03FE3106858C76BB7 + .quad 0x03FE315DA4434068B # 0.596417554101 835 + .quad 0x03FE315DA4434068B + .quad 0x03FE3197C0FA80E6A # 0.596860914783 836 + .quad 0x03FE3197C0FA80E6A + .quad 0x03FE31EEF86D36EF1 # 0.597526324589 837 + .quad 0x03FE31EEF86D36EF1 + .quad 0x03FE322925A66E62D # 0.597970177237 838 + .quad 0x03FE322925A66E62D + .quad 0x03FE328075E32022F # 0.598636325813 839 + .quad 0x03FE328075E32022F + .quad 0x03FE32BAB3A7B21E9 # 0.599080671521 840 + .quad 0x03FE32BAB3A7B21E9 + .quad 0x03FE32F4F80D0B1BD # 0.599525214760 841 + .quad 0x03FE32F4F80D0B1BD + .quad 0x03FE334C6B15D30DD # 0.600192400374 842 + .quad 0x03FE334C6B15D30DD + .quad 0x03FE3386C013B90D6 # 0.600637438209 843 + .quad 0x03FE3386C013B90D6 + .quad 0x03FE33DE4C086C40A # 0.601305366543 844 + .quad 0x03FE33DE4C086C40A + .quad 0x03FE3418B1A85622C # 0.601750900077 845 + .quad 0x03FE3418B1A85622C + .quad 0x03FE34531DF21CFE3 # 0.602196632199 846 + .quad 0x03FE34531DF21CFE3 + .quad 0x03FE34AACCE299BA5 # 0.602865603124 847 + .quad 0x03FE34AACCE299BA5 + .quad 0x03FE34E549DBB21EF # 0.603311832493 848 + .quad 0x03FE34E549DBB21EF + .quad 0x03FE353D11DA4F855 # 0.603981550121 849 + .quad 0x03FE353D11DA4F855 + .quad 0x03FE35779F8C43D6D # 0.604428277847 850 + .quad 0x03FE35779F8C43D6D + .quad 0x03FE35B233F13DD4A # 0.604875205229 851 + .quad 0x03FE35B233F13DD4A + .quad 0x03FE360A1F1BBA738 # 0.605545971045 852 + .quad 0x03FE360A1F1BBA738 + .quad 0x03FE3644C446F97BC # 0.605993398346 853 + .quad 0x03FE3644C446F97BC + .quad 0x03FE367F702A9EA94 # 0.606441025927 854 + .quad 0x03FE367F702A9EA94 + .quad 0x03FE36D77E9D34FD7 # 0.607112843218 855 + .quad 0x03FE36D77E9D34FD7 + .quad 0x03FE37123B54987B7 # 0.607560972287 856 + .quad 0x03FE37123B54987B7 + .quad 0x03FE376A630C0A1D6 # 0.608233542652 857 + .quad 0x03FE376A630C0A1D6 + .quad 0x03FE37A530A0D5A31 # 0.608682174333 858 + .quad 0x03FE37A530A0D5A31 + .quad 0x03FE37E004F74E13B # 0.609131007374 859 + .quad 0x03FE37E004F74E13B + .quad 0x03FE383850278CFD9 # 0.609804634884 860 + .quad 0x03FE383850278CFD9 + .quad 0x03FE3873356902AB7 # 0.610253972119 861 + .quad 0x03FE3873356902AB7 + .quad 0x03FE38AE2171976E8 # 0.610703511349 862 + .quad 0x03FE38AE2171976E8 + .quad 0x03FE390690373AFFF # 0.611378199331 863 + .quad 0x03FE390690373AFFF + .quad 0x03FE39418D3872A53 # 0.611828244343 864 + .quad 0x03FE39418D3872A53 + .quad 0x03FE397C91064221F # 0.612278491987 865 + .quad 0x03FE397C91064221F + .quad 0x03FE39D5237E045A5 # 0.612954243787 866 + .quad 0x03FE39D5237E045A5 + .quad 0x03FE3A1038522CE82 # 0.613404998809 867 + .quad 0x03FE3A1038522CE82 + .quad 0x03FE3A68E45AD354B # 0.614081512534 868 + .quad 0x03FE3A68E45AD354B + .quad 0x03FE3AA40A3F2A68B # 0.614532776080 869 + .quad 0x03FE3AA40A3F2A68B + .quad 0x03FE3ADF36F98A182 # 0.614984243356 870 + .quad 0x03FE3ADF36F98A182 + .quad 0x03FE3B3806E5DF340 # 0.615661826668 871 + .quad 0x03FE3B3806E5DF340 + .quad 0x03FE3B7344BE40311 # 0.616113804077 872 + .quad 0x03FE3B7344BE40311 + .quad 0x03FE3BAE897234A87 # 0.616565985862 873 + .quad 0x03FE3BAE897234A87 + .quad 0x03FE3C077D5F51881 # 0.617244642149 874 + .quad 0x03FE3C077D5F51881 + .quad 0x03FE3C42D33F2AE7B # 0.617697335683 875 + .quad 0x03FE3C42D33F2AE7B + .quad 0x03FE3C7E30002960C # 0.618150234241 876 + .quad 0x03FE3C7E30002960C + .quad 0x03FE3CD7480B4A8A3 # 0.618829966906 877 + .quad 0x03FE3CD7480B4A8A3 + .quad 0x03FE3D12B60622748 # 0.619283378838 878 + .quad 0x03FE3D12B60622748 + .quad 0x03FE3D4E2AE7B7E2B # 0.619736996447 879 + .quad 0x03FE3D4E2AE7B7E2B + .quad 0x03FE3D89A6B1A558D # 0.620190819917 880 + .quad 0x03FE3D89A6B1A558D + .quad 0x03FE3DE2ED57B1F9B # 0.620871941524 881 + .quad 0x03FE3DE2ED57B1F9B + .quad 0x03FE3E1E7A6D8330E # 0.621326280468 882 + .quad 0x03FE3E1E7A6D8330E + .quad 0x03FE3E5A0E714DA6E # 0.621780825931 883 + .quad 0x03FE3E5A0E714DA6E + .quad 0x03FE3EB37978B85B6 # 0.622463031756 884 + .quad 0x03FE3EB37978B85B6 + .quad 0x03FE3EEF1ED68236B # 0.622918094335 885 + .quad 0x03FE3EEF1ED68236B + .quad 0x03FE3F2ACB27ED6C7 # 0.623373364090 886 + .quad 0x03FE3F2ACB27ED6C7 + .quad 0x03FE3F845AAE68C81 # 0.624056657591 887 + .quad 0x03FE3F845AAE68C81 + .quad 0x03FE3FC0186800514 # 0.624512446113 888 + .quad 0x03FE3FC0186800514 + .quad 0x03FE3FFBDD1AE8406 # 0.624968442473 889 + .quad 0x03FE3FFBDD1AE8406 + .quad 0x03FE4037A8C8C197A # 0.625424646860 890 + .quad 0x03FE4037A8C8C197A + .quad 0x03FE409167679DD99 # 0.626109343909 891 + .quad 0x03FE409167679DD99 + .quad 0x03FE40CD448FF6DD6 # 0.626566069196 892 + .quad 0x03FE40CD448FF6DD6 + .quad 0x03FE410928B8F950F # 0.627023003177 893 + .quad 0x03FE410928B8F950F + .quad 0x03FE41630C1B50AFF # 0.627708795866 894 + .quad 0x03FE41630C1B50AFF + .quad 0x03FE419F01CD27AD0 # 0.628166252416 895 + .quad 0x03FE419F01CD27AD0 + .quad 0x03FE41DAFE85672B9 # 0.628623918328 896 + .quad 0x03FE41DAFE85672B9 + .quad 0x03FE42170245B4C6A # 0.629081793794 897 + .quad 0x03FE42170245B4C6A + .quad 0x03FE42711518DF546 # 0.629769000326 898 + .quad 0x03FE42711518DF546 + .quad 0x03FE42AD2A74888A0 # 0.630227400518 899 + .quad 0x03FE42AD2A74888A0 + .quad 0x03FE42E946DE080C0 # 0.630686010936 900 + .quad 0x03FE42E946DE080C0 + .quad 0x03FE43437EB9D9424 # 0.631374321162 901 + .quad 0x03FE43437EB9D9424 + .quad 0x03FE437FACCD31C10 # 0.631833457993 902 + .quad 0x03FE437FACCD31C10 + .quad 0x03FE43BBE1F42FE09 # 0.632292805727 903 + .quad 0x03FE43BBE1F42FE09 + .quad 0x03FE43F81E307DE5E # 0.632752364559 904 + .quad 0x03FE43F81E307DE5E + .quad 0x03FE445285D68EA69 # 0.633442099038 905 + .quad 0x03FE445285D68EA69 + .quad 0x03FE448ED3CF71355 # 0.633902186463 906 + .quad 0x03FE448ED3CF71355 + .quad 0x03FE44CB28E37C3EE # 0.634362485666 907 + .quad 0x03FE44CB28E37C3EE + .quad 0x03FE450785145CAFE # 0.634822996841 908 + .quad 0x03FE450785145CAFE + .quad 0x03FE45621CB769366 # 0.635514161481 909 + .quad 0x03FE45621CB769366 + .quad 0x03FE459E8AB7B799D # 0.635975203444 910 + .quad 0x03FE459E8AB7B799D + .quad 0x03FE45DAFFDABD4DB # 0.636436458065 911 + .quad 0x03FE45DAFFDABD4DB + .quad 0x03FE46177C2229EC0 # 0.636897925539 912 + .quad 0x03FE46177C2229EC0 + .quad 0x03FE467243F53F69E # 0.637590526283 913 + .quad 0x03FE467243F53F69E + .quad 0x03FE46AED21F117FC # 0.638052526753 914 + .quad 0x03FE46AED21F117FC + .quad 0x03FE46EB677335D13 # 0.638514740766 915 + .quad 0x03FE46EB677335D13 + .quad 0x03FE472803F35EAAE # 0.638977168520 916 + .quad 0x03FE472803F35EAAE + .quad 0x03FE4764A7A13EF3B # 0.639439810212 917 + .quad 0x03FE4764A7A13EF3B + .quad 0x03FE47BFAA9F80271 # 0.640134174319 918 + .quad 0x03FE47BFAA9F80271 + .quad 0x03FE47FC60471DAF8 # 0.640597351724 919 + .quad 0x03FE47FC60471DAF8 + .quad 0x03FE48391D226992D # 0.641060743762 920 + .quad 0x03FE48391D226992D + .quad 0x03FE4875E1331971E # 0.641524350631 921 + .quad 0x03FE4875E1331971E + .quad 0x03FE48D114D3FB884 # 0.642220164181 922 + .quad 0x03FE48D114D3FB884 + .quad 0x03FE490DEAF1A3FC8 # 0.642684309003 923 + .quad 0x03FE490DEAF1A3FC8 + .quad 0x03FE494AC84AB0ED3 # 0.643148669355 924 + .quad 0x03FE494AC84AB0ED3 + .quad 0x03FE4987ACE0DABB0 # 0.643613245438 925 + .quad 0x03FE4987ACE0DABB0 + .quad 0x03FE49C498B5DA63F # 0.644078037452 926 + .quad 0x03FE49C498B5DA63F + .quad 0x03FE4A20080EF10B2 # 0.644775630783 927 + .quad 0x03FE4A20080EF10B2 + .quad 0x03FE4A5D060894B8C # 0.645240963504 928 + .quad 0x03FE4A5D060894B8C + .quad 0x03FE4A9A0B471A943 # 0.645706512861 929 + .quad 0x03FE4A9A0B471A943 + .quad 0x03FE4AD717CC3E626 # 0.646172279055 930 + .quad 0x03FE4AD717CC3E626 + .quad 0x03FE4B142B99BC871 # 0.646638262288 931 + .quad 0x03FE4B142B99BC871 + .quad 0x03FE4B6FD6F970C1F # 0.647337644529 932 + .quad 0x03FE4B6FD6F970C1F + .quad 0x03FE4BACFD036D080 # 0.647804171246 933 + .quad 0x03FE4BACFD036D080 + .quad 0x03FE4BEA2A5BDBE87 # 0.648270915712 934 + .quad 0x03FE4BEA2A5BDBE87 + .quad 0x03FE4C275F047C956 # 0.648737878130 935 + .quad 0x03FE4C275F047C956 + .quad 0x03FE4C649AFF0EE16 # 0.649205058703 936 + .quad 0x03FE4C649AFF0EE16 + .quad 0x03FE4CC082B46485A # 0.649906239052 937 + .quad 0x03FE4CC082B46485A + .quad 0x03FE4CFDD1037E37C # 0.650373965908 938 + .quad 0x03FE4CFDD1037E37C + .quad 0x03FE4D3B26AAADDD9 # 0.650841911635 939 + .quad 0x03FE4D3B26AAADDD9 + .quad 0x03FE4D7883ABB61F6 # 0.651310076438 940 + .quad 0x03FE4D7883ABB61F6 + .quad 0x03FE4DB5E8085A477 # 0.651778460521 941 + .quad 0x03FE4DB5E8085A477 + .quad 0x03FE4DF353C25E42B # 0.652247064091 942 + .quad 0x03FE4DF353C25E42B + .quad 0x03FE4E4F832C560DD # 0.652950381434 943 + .quad 0x03FE4E4F832C560DD + .quad 0x03FE4E8D015786F16 # 0.653419534621 944 + .quad 0x03FE4E8D015786F16 + .quad 0x03FE4ECA86E64A683 # 0.653888908016 945 + .quad 0x03FE4ECA86E64A683 + .quad 0x03FE4F0813DA673DD # 0.654358501826 946 + .quad 0x03FE4F0813DA673DD + .quad 0x03FE4F45A835A4E19 # 0.654828316258 947 + .quad 0x03FE4F45A835A4E19 + .quad 0x03FE4F8343F9CB678 # 0.655298351519 948 + .quad 0x03FE4F8343F9CB678 + .quad 0x03FE4FDFBB88A119A # 0.656003818920 949 + .quad 0x03FE4FDFBB88A119A + .quad 0x03FE501D69DADD660 # 0.656474407164 950 + .quad 0x03FE501D69DADD660 + .quad 0x03FE505B1F9C43ED7 # 0.656945216966 951 + .quad 0x03FE505B1F9C43ED7 + .quad 0x03FE5098DCCE9FABA # 0.657416248534 952 + .quad 0x03FE5098DCCE9FABA + .quad 0x03FE50D6A173BC425 # 0.657887502077 953 + .quad 0x03FE50D6A173BC425 + .quad 0x03FE51146D8D65F98 # 0.658358977805 954 + .quad 0x03FE51146D8D65F98 + .quad 0x03FE5152411D69C03 # 0.658830675927 955 + .quad 0x03FE5152411D69C03 + .quad 0x03FE51AF0C774A2D0 # 0.659538640558 956 + .quad 0x03FE51AF0C774A2D0 + .quad 0x03FE51ECF2B713F8A # 0.660010895584 957 + .quad 0x03FE51ECF2B713F8A + .quad 0x03FE522AE0738A3D8 # 0.660483373741 958 + .quad 0x03FE522AE0738A3D8 + .quad 0x03FE5268D5AE7CDCB # 0.660956075239 959 + .quad 0x03FE5268D5AE7CDCB + .quad 0x03FE52A6D269BC600 # 0.661429000289 960 + .quad 0x03FE52A6D269BC600 + .quad 0x03FE52E4D6A719F9B # 0.661902149103 961 + .quad 0x03FE52E4D6A719F9B + .quad 0x03FE5322E26867857 # 0.662375521893 962 + .quad 0x03FE5322E26867857 + .quad 0x03FE53800225BA6E2 # 0.663086001497 963 + .quad 0x03FE53800225BA6E2 + .quad 0x03FE53BE20B8DA502 # 0.663559935155 964 + .quad 0x03FE53BE20B8DA502 + .quad 0x03FE53FC46D64DDD1 # 0.664034093533 965 + .quad 0x03FE53FC46D64DDD1 + .quad 0x03FE543A747FE9ED6 # 0.664508476843 966 + .quad 0x03FE543A747FE9ED6 + .quad 0x03FE5478A9B78404C # 0.664983085300 967 + .quad 0x03FE5478A9B78404C + .quad 0x03FE54B6E67EF251C # 0.665457919117 968 + .quad 0x03FE54B6E67EF251C + .quad 0x03FE54F52AD80BAE9 # 0.665932978509 969 + .quad 0x03FE54F52AD80BAE9 + .quad 0x03FE553376C4A7A16 # 0.666408263689 970 + .quad 0x03FE553376C4A7A16 + .quad 0x03FE5571CA469E5C9 # 0.666883774872 971 + .quad 0x03FE5571CA469E5C9 + .quad 0x03FE55CF55C5A5437 # 0.667597465874 972 + .quad 0x03FE55CF55C5A5437 + .quad 0x03FE560DBC45153C7 # 0.668073543008 973 + .quad 0x03FE560DBC45153C7 + .quad 0x03FE564C2A6059FE7 # 0.668549846899 974 + .quad 0x03FE564C2A6059FE7 + .quad 0x03FE568AA0194EC6E # 0.669026377763 975 + .quad 0x03FE568AA0194EC6E + .quad 0x03FE56C91D71CF810 # 0.669503135817 976 + .quad 0x03FE56C91D71CF810 + .quad 0x03FE5707A26BB8C66 # 0.669980121278 977 + .quad 0x03FE5707A26BB8C66 + .quad 0x03FE57462F08E7DF5 # 0.670457334363 978 + .quad 0x03FE57462F08E7DF5 + .quad 0x03FE5784C34B3AC30 # 0.670934775289 979 + .quad 0x03FE5784C34B3AC30 + .quad 0x03FE57C35F3490183 # 0.671412444273 980 + .quad 0x03FE57C35F3490183 + .quad 0x03FE580202C6C7353 # 0.671890341535 981 + .quad 0x03FE580202C6C7353 + .quad 0x03FE5840AE03C0204 # 0.672368467291 982 + .quad 0x03FE5840AE03C0204 + .quad 0x03FE589EBD437CA31 # 0.673086084831 983 + .quad 0x03FE589EBD437CA31 + .quad 0x03FE58DD7BB392B30 # 0.673564782782 984 + .quad 0x03FE58DD7BB392B30 + .quad 0x03FE591C41D500163 # 0.674043709994 985 + .quad 0x03FE591C41D500163 + .quad 0x03FE595B0FA9A7EF1 # 0.674522866688 986 + .quad 0x03FE595B0FA9A7EF1 + .quad 0x03FE5999E5336E121 # 0.675002253082 987 + .quad 0x03FE5999E5336E121 + .quad 0x03FE59D8C2743705E # 0.675481869398 988 + .quad 0x03FE59D8C2743705E + .quad 0x03FE5A17A76DE803B # 0.675961715857 989 + .quad 0x03FE5A17A76DE803B + .quad 0x03FE5A56942266F7B # 0.676441792678 990 + .quad 0x03FE5A56942266F7B + .quad 0x03FE5A9588939A810 # 0.676922100084 991 + .quad 0x03FE5A9588939A810 + .quad 0x03FE5AD484C369F2D # 0.677402638296 992 + .quad 0x03FE5AD484C369F2D + .quad 0x03FE5B1388B3BD53E # 0.677883407536 993 + .quad 0x03FE5B1388B3BD53E + .quad 0x03FE5B5294667D5F7 # 0.678364408027 994 + .quad 0x03FE5B5294667D5F7 + .quad 0x03FE5B91A7DD93852 # 0.678845639990 995 + .quad 0x03FE5B91A7DD93852 + .quad 0x03FE5BD0C31AE9E9D # 0.679327103649 996 + .quad 0x03FE5BD0C31AE9E9D + .quad 0x03FE5C2F7A8ED5E5B # 0.680049734055 997 + .quad 0x03FE5C2F7A8ED5E5B + .quad 0x03FE5C6EA94431EF9 # 0.680531777930 998 + .quad 0x03FE5C6EA94431EF9 + .quad 0x03FE5CADDFC6874F5 # 0.681014054284 999 + .quad 0x03FE5CADDFC6874F5 + .quad 0x03FE5CED1E17C35C6 # 0.681496563340 1000 + .quad 0x03FE5CED1E17C35C6 + .quad 0x03FE5D2C6439D4252 # 0.681979305324 1001 + .quad 0x03FE5D2C6439D4252 + .quad 0x03FE5D6BB22EA86F6 # 0.682462280460 1002 + .quad 0x03FE5D6BB22EA86F6 + .quad 0x03FE5DAB07F82FB84 # 0.682945488974 1003 + .quad 0x03FE5DAB07F82FB84 + .quad 0x03FE5DEA65985A350 # 0.683428931091 1004 + .quad 0x03FE5DEA65985A350 + .quad 0x03FE5E29CB1118D32 # 0.683912607038 1005 + .quad 0x03FE5E29CB1118D32 + .quad 0x03FE5E6938645D390 # 0.684396517040 1006 + .quad 0x03FE5E6938645D390 + .quad 0x03FE5EA8AD9419C5B # 0.684880661324 1007 + .quad 0x03FE5EA8AD9419C5B + .quad 0x03FE5EE82AA241920 # 0.685365040118 1008 + .quad 0x03FE5EE82AA241920 + .quad 0x03FE5F27AF90C8705 # 0.685849653648 1009 + .quad 0x03FE5F27AF90C8705 + .quad 0x03FE5F673C61A2ED2 # 0.686334502142 1010 + .quad 0x03FE5F673C61A2ED2 + .quad 0x03FE5FA6D116C64F7 # 0.686819585829 1011 + .quad 0x03FE5FA6D116C64F7 + .quad 0x03FE5FE66DB228992 # 0.687304904936 1012 + .quad 0x03FE5FE66DB228992 + .quad 0x03FE60261235C0874 # 0.687790459692 1013 + .quad 0x03FE60261235C0874 + .quad 0x03FE6065BEA385926 # 0.688276250325 1014 + .quad 0x03FE6065BEA385926 + .quad 0x03FE60A572FD6FEF1 # 0.688762277066 1015 + .quad 0x03FE60A572FD6FEF1 + .quad 0x03FE60E52F45788E4 # 0.689248540144 1016 + .quad 0x03FE60E52F45788E4 + .quad 0x03FE6124F37D991D4 # 0.689735039789 1017 + .quad 0x03FE6124F37D991D4 + .quad 0x03FE6164BFA7CC06C # 0.690221776231 1018 + .quad 0x03FE6164BFA7CC06C + .quad 0x03FE61A493C60C729 # 0.690708749700 1019 + .quad 0x03FE61A493C60C729 + .quad 0x03FE61E46FDA56466 # 0.691195960429 1020 + .quad 0x03FE61E46FDA56466 + .quad 0x03FE622453E6A6263 # 0.691683408647 1021 + .quad 0x03FE622453E6A6263 + .quad 0x03FE62643FECF9743 # 0.692171094587 1022 + .quad 0x03FE62643FECF9743 + .quad 0x03FE62A433EF4E51A # 0.692659018480 1023 + .quad 0x03FE62A433EF4E51A
diff --git a/src/gas/vrdacos.S b/src/gas/vrdacos.S new file mode 100644 index 0000000..5e2b3a4 --- /dev/null +++ b/src/gas/vrdacos.S
@@ -0,0 +1,3118 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrdacos.s +# +# An array implementation of the cos libm function. +# +# Prototype: +# +# void vrda_cos(int n, double *x, double *y); +# +#Computes Cosine of x for an array of input values. +#Places the results into the supplied y array. +#Does not perform error checking. +#Denormal inputs may produce unexpected results +#Author: Harsha Jagasia +#Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 16 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # + +.Lcosarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03fa5555555555555 + .quad 0x0bf56c16c16c16967 # -0.00138889 c2 + .quad 0x0bf56c16c16c16967 + .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3 + .quad 0x03efa01a019f4ec90 + .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4 + .quad 0x0be927e4fa17f65f6 + .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5 + .quad 0x03e21eeb69037ab78 + .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6 + .quad 0x0bda907db46cc5e42 +.Lsinarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bfc5555555555555 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03f81111111110bb3 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0bf2a01a019e83e5c + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03ec71de3796cde01 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0be5ae600b42fdfa7 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x03de5e0b2f9a43bb8 +.Lsincosarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x0bf56c16c16c16967 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x03efa01a019f4ec90 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x0be927e4fa17f65f6 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x03e21eeb69037ab78 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x0bda907db46cc5e42 +.Lcossinarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bf56c16c16c16967 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03efa01a019f4ec90 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0be927e4fa17f65f6 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03e21eeb69037ab78 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0bda907db46cc5e42 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + +.align 16 +.Levencos_oddsin_tbl: + .quad .Lcoscos_coscos_piby4 # 0 * + .quad .Lcoscos_cossin_piby4 # 1 + + .quad .Lcoscos_sincos_piby4 # 2 + .quad .Lcoscos_sinsin_piby4 # 3 + + + .quad .Lcossin_coscos_piby4 # 4 + .quad .Lcossin_cossin_piby4 # 5 * + .quad .Lcossin_sincos_piby4 # 6 + .quad .Lcossin_sinsin_piby4 # 7 + + .quad .Lsincos_coscos_piby4 # 8 + .quad .Lsincos_cossin_piby4 # 9 + .quad .Lsincos_sincos_piby4 # 10 * + .quad .Lsincos_sinsin_piby4 # 11 + + .quad .Lsinsin_coscos_piby4 # 12 + .quad .Lsinsin_cossin_piby4 # 13 + + .quad .Lsinsin_sincos_piby4 # 14 + .quad .Lsinsin_sinsin_piby4 # 15 * + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + .weak vrda_cos_ + .set vrda_cos_,__vrda_cos__ + .weak vrda_cos__ + .set vrda_cos__,__vrda_cos__ + + .text + .align 16 + .p2align 4,,15 + +#x/* a FORTRAN subroutine implementation of array cos +#** VRDA_COS(N,X,Y) +# C equivalent*/ +#void vrda_cos__(int * n, double *x, double *y) +#{ +# vrda_cos(*n,x,y); +#} +.globl __vrda_cos__ + .type __vrda_cos__,@function +__vrda_cos__: + mov (%rdi),%edi + + .align 16 + .p2align 4,,15 + +# define local variable storage offsets +.equ p_temp, 0x00 # temporary for get/put bits operation +.equ p_temp1, 0x10 # temporary for get/put bits operation + +.equ p_xmm6, 0x20 # temporary for get/put bits operation +.equ p_xmm7, 0x30 # temporary for get/put bits operation +.equ p_xmm8, 0x40 # temporary for get/put bits operation +.equ p_xmm9, 0x50 # temporary for get/put bits operation +.equ p_xmm10, 0x60 # temporary for get/put bits operation +.equ p_xmm11, 0x70 # temporary for get/put bits operation +.equ p_xmm12, 0x80 # temporary for get/put bits operation +.equ p_xmm13, 0x90 # temporary for get/put bits operation +.equ p_xmm14, 0x0A0 # temporary for get/put bits operation +.equ p_xmm15, 0x0B0 # temporary for get/put bits operation + +.equ r, 0x0C0 # pointer to r for remainder_piby2 +.equ rr, 0x0D0 # pointer to r for remainder_piby2 +.equ region, 0x0E0 # pointer to r for remainder_piby2 + +.equ r1, 0x0F0 # pointer to r for remainder_piby2 +.equ rr1, 0x0100 # pointer to r for remainder_piby2 +.equ region1, 0x0110 # pointer to r for remainder_piby2 + +.equ p_temp2, 0x0120 # temporary for get/put bits operation +.equ p_temp3, 0x0130 # temporary for get/put bits operation + +.equ p_temp4, 0x0140 # temporary for get/put bits operation +.equ p_temp5, 0x0150 # temporary for get/put bits operation + +.equ p_original, 0x0160 # original x +.equ p_mask, 0x0170 # original x +.equ p_sign, 0x0180 # original x + +.equ p_original1, 0x0190 # original x +.equ p_mask1, 0x01A0 # original x +.equ p_sign1, 0x01B0 # original x + +.equ save_xa, 0x01C0 #qword +.equ save_ya, 0x01D0 #qword + +.equ save_nv, 0x01E0 #qword +.equ p_iter, 0x01F0 #qword storage for number of loop iterations + + +.globl vrda_cos + .type vrda_cos,@function +vrda_cos: +# parameters are passed in by Linux C as: +# edi - int n +# rsi - double *x +# rdx - double *y + + + sub $0x208,%rsp + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#START PROCESS INPUT +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + mov %rdi,save_nv(%rsp) # save number of values + +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vrda_cleanup # jump if only single calls + +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#START LOOP +.align 16 +.L__vrda_top: +# build the input _m128d + movapd .L__real_7fffffffffffffff(%rip),%xmm2 + mov save_xa(%rsp),%rsi # get x_array pointer + movlpd (%rsi),%xmm0 + movhpd 8(%rsi),%xmm0 + + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + movdqa %xmm0,p_original(%rsp) + movlpd -16(%rsi), %xmm1 + movhpd -8(%rsi), %xmm1 + movdqa %xmm1,p_original1(%rsp) + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#STARTMAIN + +andpd %xmm2,%xmm0 #Unsign +andpd %xmm2,%xmm1 #Unsign + +movd %xmm0,%rax #rax is lower arg +movhpd %xmm0, p_temp+8(%rsp) # +mov p_temp+8(%rsp),%rcx #rcx = upper arg +movd %xmm1,%r8 #rax is lower arg +movhpd %xmm1, p_temp1+8(%rsp) # +mov p_temp1+8(%rsp),%r9 #rcx = upper arg + +mov $0x3FE921FB54442D18,%rdx #piby4 + +mov $0x411E848000000000,%r10 #5e5 + + +movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + +movapd %xmm0,%xmm2 #x0 +movapd %xmm1,%xmm3 #x1 +movapd %xmm0,%xmm6 #x0 +movapd %xmm1,%xmm7 #x1 + +#DEBUG +# add $0x1C8,%rsp +# ret +# movapd %xmm0,%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm2 = x, xmm4 =0.5/t, xmm6 =x +# xmm3 = x, xmm5 =0.5/t, xmm7 =x +.align 16 +.Leither_or_both_arg_gt_than_piby4: + cmp %r10,%rax + jae .Lfirst_or_next3_arg_gt_5e5 + + cmp %r10,%rcx + jae .Lsecond_or_next2_arg_gt_5e5 + + cmp %r10,%r8 + jae .Lthird_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfourth_arg_gt_5e5 + + +# /* Find out what multiple of piby2 */ +# npi2 = (int)(x * twobypi + 0.5); + movapd .L__real_3fe45f306dc9c883(%rip),%xmm0 + mulpd %xmm0,%xmm2 # * twobypi + mulpd %xmm0,%xmm3 # * twobypi + + addpd %xmm4,%xmm2 # +0.5, npi2 + addpd %xmm4,%xmm3 # +0.5, npi2 + + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + + cvtdq2pd %xmm4,%xmm2 # and back to double. + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + + movd %xmm4,%rax # Region + movd %xmm5,%rcx # Region + + mov %rax,%r8 + mov %rcx,%r9 + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm1,%xmm7 # t-rhead + + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = npi2 (int), xmm0 =rhead, xmm8 =rtail +# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail + +# paddd .L__reald_one_one(%rip),%xmm4 ; Sign +# paddd .L__reald_one_one(%rip),%xmm5 ; Sign +# pand .L__reald_two_two(%rip),%xmm4 +# pand .L__reald_two_two(%rip),%xmm5 +# punpckldq %xmm4,%xmm4 +# punpckldq %xmm5,%xmm5 +# psllq $62,%xmm4 +# psllq $62,%xmm5 + + + add .L__reald_one_one(%rip),%r8 + add .L__reald_one_one(%rip),%r9 + and .L__reald_two_two(%rip),%r8 + and .L__reald_two_two(%rip),%r9 + + mov %r8,%r10 + mov %r9,%r11 + shl $62,%r8 + and .L__reald_two_zero(%rip),%r10 + shl $30,%r10 + shl $62,%r9 + and .L__reald_two_zero(%rip),%r11 + shl $30,%r11 + + mov %r8,p_sign(%rsp) + mov %r10,p_sign+8(%rsp) + mov %r9,p_sign1(%rsp) + mov %r11,p_sign1+8(%rsp) + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead +# xmm4 = Sign, xmm0 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail + movapd %xmm0,%xmm6 # rhead + movapd %xmm1,%xmm7 # rhead + + and .L__reald_one_one(%rip),%rax # Region + and .L__reald_one_one(%rip),%rcx # Region + + subpd %xmm8,%xmm0 # r = rhead - rtail + subpd %xmm9,%xmm1 # r = rhead - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail + + subpd %xmm0,%xmm6 #rr=rhead-r + subpd %xmm1,%xmm7 #rr=rhead-r + + mov %rax,%r8 + mov %rcx,%r9 + + movapd %xmm0,%xmm2 + movapd %xmm1,%xmm3 + + mulpd %xmm0,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + subpd %xmm8,%xmm6 #rr=(rhead-r) -rtail + subpd %xmm9,%xmm7 #rr=(rhead-r) -rtail + + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + leaq .Levencos_oddsin_tbl(%rip),%rsi + jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfirst_or_next3_arg_gt_5e5: +# %rcx,,%rax r8, r9 + +#DEBUG +# movapd %xmm0,%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Be sure not to use %xmm3,%xmm1 and xmm7 +# Use %xmm8,,%xmm5 xmm10, xmm12 +# %xmm11,,%xmm9 xmm13 + + +#DEBUG +# movapd %xmm0,%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + + movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm6,%xmm6 + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1 + cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm6,%xmm0 + subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm6 # rr=rhead-r + subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm0,r+8(%rsp) # store upper r + movlpd %xmm6,rr+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_lower_naninf + + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrd4_cos_lower_naninf: + mov p_original(%rsp),%rax # upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) # rr = 0 + mov %r10d,region(%rsp) # region =0 + +.align 16 +0: + + +#DEBUG +# movapd .LOWORD,%xmm4 PTR r[rsp] +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + + + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + + +#DEBUG +# movapd %xmm0,%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + + + + movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_lower_naninf_of_both_gt_5e5 + + mov %rcx,p_temp(%rsp) #Save upper arg + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%rcx #Restore upper arg + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrd4_cos_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf + mov p_original(%rsp),%rax + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) #rr = 0 + mov %r10d,region(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_upper_naninf_of_both_gt_5e5 + + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrd4_cos_upper_naninf_of_both_gt_5e5: + mov p_original+8(%rsp),%rcx #upper arg is nan/inf +# movd %xmm6,%rcx ;upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) #rr = 0 + mov %r10d,region+4(%rsp) #region = 0 + +.align 16 +0: + jmp .Lcheck_next2_args + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsecond_or_next2_arg_gt_5e5: +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Restore xmm4 and %xmm3,,%xmm1 xmm7 +# Can use %xmm10,,%xmm8 xmm12 +# %xmm9,,%xmm5 xmm11, xmm13 + + movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call +# movlhps %xmm0,%xmm0 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case +# movlhps %xmm2,%xmm2 +# movlhps %xmm6,%xmm6 + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1 + cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm1 = piby2_2 + cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store upper region + movsd %xmm6,%xmm0 + subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm6 # rr=rhead-r + subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm0,r(%rsp) # store upper r + movlpd %xmm6,rr(%rsp) # store upper rr + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_upper_naninf + + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrd4_cos_upper_naninf: + mov p_original+8(%rsp),%rcx # upper arg is nan/inf +# mov r+8(%rsp),%rcx ; upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) # rr = 0 + mov %r10d,region+4(%rsp) # region =0 + +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcheck_next2_args: + +#DEBUG +# movapd r(%rsp),%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + mov $0x411E848000000000,%r10 #5e5 + + + cmp %r10,%r8 + jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfirst_second_done_fourth_arg_gt_5e5 + +# Work on next two args, both < 5e5 +# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5 + + movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi + addpd %xmm4,%xmm3 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm5,region1(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm1,%xmm7 # t-rhead + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm1,%xmm7 # rhead + subpd %xmm9,%xmm1 # r = rhead - rtail + movapd %xmm1,r1(%rsp) + + subpd %xmm1,%xmm7 # rr=rhead-r + subpd %xmm9,%xmm7 # rr=(rhead-r) -rtail + movapd %xmm7,rr1(%rsp) + + jmp .L__vrd4_cos_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lthird_or_fourth_arg_gt_5e5: +#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Can use %xmm11,,%xmm9 xmm13 +# %xmm8,,%xmm5 xmm10, xmm12 +# Restore xmm4 + +# Work on first two args, both < 5e5 + +#DEBUG +# movapd %xmm0,%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm4,region(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm0,%xmm6 # rhead + subpd %xmm8,%xmm0 # r = rhead - rtail + movapd %xmm0,r(%rsp) + + subpd %xmm0,%xmm6 # rr=rhead-r + subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail + movapd %xmm6,rr(%rsp) + + +# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_third_or_fourth_arg_gt_5e5: +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + + +#DEBUG +# movapd r(%rsp),%xmm4 +# movapd %xmm1,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + mov $0x411E848000000000,%r10 #5e5 + + cmp %r10,%r9 + jae .Lboth_arg_gt_5e5_higher + + +# Upper Arg is <5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg + movhlps %xmm3,%xmm3 + movhlps %xmm7,%xmm7 + movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r9d,region1+4(%rsp) # store upper region + movsd %xmm7,%xmm1 + subsd %xmm0,%xmm1 # xmm1 = r=(rhead-rtail) + subsd %xmm1,%xmm7 # rr=rhead-r + subsd %xmm0,%xmm7 # xmm7 = rr=((rhead-r) -rtail) + movlpd %xmm1,r1+8(%rsp) # store upper r + movlpd %xmm7,rr1+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_lower_naninf_higher + + lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf + lea rr1(%rsp),%rsi + lea r1(%rsp),%rdi + movlpd r1(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_cos_lower_naninf_higher: + mov p_original1(%rsp),%r8 # upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1(%rsp) # rr = 0 + mov %r10d,region1(%rsp) # region =0 + +.align 16 +0: + + +#DEBUG +# movapd rr(%rsp),%xmm4 +# movapd rr1(%rsp),%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + jmp .L__vrd4_cos_reconstruct + + + + + + + +.align 16 +.Lboth_arg_gt_5e5_higher: +# Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + +#DEBUG +# movapd r(%rsp),%xmm4 +# movd %r8,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_lower_naninf_of_both_gt_5e5_higher + + mov %r9,p_temp1(%rsp) #Save upper arg + lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf + lea rr1(%rsp),%rsi + lea r1(%rsp),%rdi + movsd %xmm1,%xmm0 + call __amd_remainder_piby2@PLT + mov p_temp1(%rsp),%r9 #Restore upper arg + + +#DEBUG +# movapd r(%rsp),%xmm4 +# mov QWORD PTR r1[rsp+8], r9 +# movapd r1(%rsp),%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + + + jmp 0f + +.L__vrd4_cos_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf + mov p_original1(%rsp),%r8 + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1(%rsp) #rr = 0 + mov %r10d,region1(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_upper_naninf_of_both_gt_5e5_higher + + lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea rr1+8(%rsp),%rsi + lea r1+8(%rsp),%rdi + movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_cos_upper_naninf_of_both_gt_5e5_higher: + mov p_original1+8(%rsp),%r9 #upper arg is nan/inf +# movd %xmm6,%r9 ;upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1+8(%rsp) #rr = 0 + mov %r10d,region1+4(%rsp) #region = 0 + +.align 16 +0: + +#DEBUG +# movapd r(%rsp),%xmm4 +# movapd r1(%rsp),%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + + jmp .L__vrd4_cos_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfourth_arg_gt_5e5: +#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5 +#%rcx,,%rax r8, r9 +#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + +# Work on first two args, both < 5e5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm4,region(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm0,%xmm6 # rhead + subpd %xmm8,%xmm0 # r = rhead - rtail + movapd %xmm0,r(%rsp) + + subpd %xmm0,%xmm6 # rr=rhead-r + subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail + movapd %xmm6,rr(%rsp) + + +# Work on next two args, third arg < 5e5, fourth arg >= 5e5 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_fourth_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call +# movlhps %xmm1,%xmm1 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case +# movlhps %xmm3,%xmm3 +# movlhps %xmm7,%xmm7 + movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r8d,region1(%rsp) # store lower region + movsd %xmm7,%xmm1 + subsd %xmm0,%xmm1 # xmm0 = r=(rhead-rtail) + subsd %xmm1,%xmm7 # rr=rhead-r + subsd %xmm0,%xmm7 # xmm6 = rr=((rhead-r) -rtail) + + movlpd %xmm1,r1(%rsp) # store upper r + movlpd %xmm7,rr1(%rsp) # store upper rr + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrd4_cos_upper_naninf_higher + + lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea rr1+8(%rsp),%rsi + lea r1+8(%rsp),%rdi + movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_cos_upper_naninf_higher: + mov p_original1+8(%rsp),%r9 # upper arg is nan/inf +# mov r1+8(%rsp),%r9 # upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1+8(%rsp) # rr = 0 + mov %r10d,region1+4(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrd4_cos_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrd4_cos_reconstruct: +#Results +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +#DEBUG +# movapd region(%rsp),%xmm4 +# movapd region1(%rsp),%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + movapd r(%rsp),%xmm0 + movapd r1(%rsp),%xmm1 + + movapd rr(%rsp),%xmm6 + movapd rr1(%rsp),%xmm7 + + mov region(%rsp),%rax + mov region1(%rsp),%rcx + + mov %rax,%r8 + mov %rcx,%r9 + + add .L__reald_one_one(%rip),%r8 + add .L__reald_one_one(%rip),%r9 + and .L__reald_two_two(%rip),%r8 + and .L__reald_two_two(%rip),%r9 + + mov %r8,%r10 + mov %r9,%r11 + shl $62,%r8 + and .L__reald_two_zero(%rip),%r10 + shl $30,%r10 + shl $62,%r9 + and .L__reald_two_zero(%rip),%r11 + shl $30,%r11 + + mov %r8,p_sign(%rsp) + mov %r10,p_sign+8(%rsp) + mov %r9,p_sign1(%rsp) + mov %r11,p_sign1+8(%rsp) + + and .L__reald_one_one(%rip),%rax # Region + and .L__reald_one_one(%rip),%rcx # Region + + mov %rax,%r8 + mov %rcx,%r9 + + movapd %xmm0,%xmm2 + movapd %xmm1,%xmm3 + + mulpd %xmm0,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + +#DEBUG +# movd %rax,%xmm4 +# movd %rax,%xmm5 +# xorpd %xmm0,%xmm0 +# xorpd %xmm1,%xmm1 +# jmp .L__vrd4_cos_cleanup +#DEBUG + + leaq .Levencos_oddsin_tbl(%rip),%rsi + jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrd4_cos_cleanup: + + movapd p_sign(%rsp), %xmm0 + movapd p_sign1(%rsp),%xmm1 + + xorpd %xmm4,%xmm0 # (+) Sign + xorpd %xmm5,%xmm1 # (+) Sign + +.L__vrda_bottom1: +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + movlpd %xmm0,(%rdi) + movhpd %xmm0,8(%rdi) + +.L__vrda_bottom2: + prefetch 64(%rdi) + add $32,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + +# store the result _m128d + movlpd %xmm1, -16(%rdi) + movhpd %xmm1, -8(%rdi) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vrda_top + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vrda_cleanup + +.L__final_check: + add $0x208,%rsp + ret + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# we jump here when we have an odd number of cos calls to make at the end +# we assume that rdx is pointing at the next x array element, r8 at the next y array element. +# The number of values left is in save_nv + +.align 16 +.L__vrda_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__final_check # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + +# fill in a m128d with zeroes and the extra values and then make a recursive call. + xorpd %xmm0,%xmm0 + movlpd %xmm0,p_temp+8(%rsp) + movapd %xmm0,p_temp+16(%rsp) + + mov (%rsi),%rcx # we know there's at least one + mov %rcx,p_temp(%rsp) + cmp $2,%rax + jl .L__vrdacg + + mov 8(%rsi),%rcx # do the second value + mov %rcx,p_temp+8(%rsp) + cmp $3,%rax + jl .L__vrdacg + + mov 16(%rsi),%rcx # do the third value + mov %rcx,p_temp+16(%rsp) + +.L__vrdacg: + mov $4,%rdi # parameter for N + lea p_temp(%rsp),%rsi # &x parameter + lea p_temp2(%rsp),%rdx # &y parameter + call vrda_cos@PLT + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + mov p_temp2(%rsp),%rcx + mov %rcx, (%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vrdacgf + + mov p_temp2+8(%rsp),%rcx + mov %rcx, 8(%rdi) # do the second value + cmp $3,%rax + jl .L__vrdacgf + + mov p_temp2+16(%rsp),%rcx + mov %rcx, 16(%rdi) # do the third value + +.L__vrdacgf: + jmp .L__final_check + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_coscos_piby4: + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 ;trash r + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm2 # x4 recalculate + mulpd %xmm3,%xmm3 # x4 recalculate + + movapd p_temp2(%rsp),%xmm12 # r + movapd p_temp3(%rsp),%xmm13 # r + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm1,%xmm7 # x * xx + + subpd %xmm12,%xmm10 # (1 + (-t)) - r + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + + subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1 + subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1 + + subpd %xmm12,%xmm4 # + t + subpd %xmm13,%xmm5 # + t + + jmp .L__vrd4_cos_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcossin_cossin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + + movdqa .Lsincosarray+0x50(%rip),%xmm4 # s6 + movdqa .Lsincosarray+0x50(%rip),%xmm5 # s6 + movapd .Lsincosarray+0x20(%rip),%xmm8 # s3 + movapd .Lsincosarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + movhlps %xmm10,%xmm10 # move high x4 for cos term + movhlps %xmm11,%xmm11 # move high x4 for cos term + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term + movsd %xmm3,%xmm7 # move low x2 for x3 for sin term + mulsd %xmm0,%xmm6 # get low x3 for sin term + mulsd %xmm1,%xmm7 # get low x3 for sin term + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high r for cos + movhlps %xmm3,%xmm13 # move high r for cos + movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos + movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm7,%xmm5 # sin *x3 + mulsd %xmm10,%xmm8 # cos *x4 + mulsd %xmm11,%xmm9 # cos *x4 + + mulsd p_temp(%rsp),%xmm2 # 0.5 * x2 * xx for sin term + mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term + movsd %xmm12,%xmm6 # Keep high r for cos term + movsd %xmm13,%xmm7 # Keep high r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t=r-1.0 + + subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx + subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx + + movhlps %xmm0,%xmm10 # move high x for x*xx for cos term + movhlps %xmm1,%xmm11 # move high x for x*xx for cos term + + mulsd p_temp+8(%rsp),%xmm10 # x * xx + mulsd p_temp1+8(%rsp),%xmm11 # x * xx + + movsd %xmm12,%xmm2 # move -t for cos term + movsd %xmm13,%xmm3 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm12 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm13 # 1+(-t) + addsd p_temp(%rsp),%xmm4 # sin+xx + addsd p_temp1(%rsp),%xmm5 # sin+xx + subsd %xmm6,%xmm12 # (1-t) - r + subsd %xmm7,%xmm13 # (1-t) - r + subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx + subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx + addsd %xmm0,%xmm4 # sin + x + addsd %xmm1,%xmm5 # sin + x + addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx) + addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx) + subsd %xmm2,%xmm8 # cos+t + subsd %xmm3,%xmm9 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + jmp .L__vrd4_cos_cleanup + +.align 16 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lsincos_cossin_piby4: # changed from sincos_sincos + # xmm1 is cossin and xmm0 is sincos + + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + movapd %xmm1,p_temp3(%rsp) # Store r for the sincos term + + movapd .Lsincosarray+0x50(%rip),%xmm4 # s6 + movapd .Lcossinarray+0x50(%rip),%xmm5 # s6 + movdqa .Lsincosarray+0x20(%rip),%xmm8 # s3 + movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + movhlps %xmm10,%xmm10 # move high x4 for cos term + + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin) + movhlps %xmm3,%xmm7 # move high x2 for x3 for sin term (sincos) + + mulsd %xmm0,%xmm6 # get low x3 for sin term + mulsd p_temp3+8(%rsp),%xmm7 # get high x3 for sin term + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high r for cos (cossin) + movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term (sincos) + + movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin) + movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos) + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm11,%xmm5 # cos *x4 + mulsd %xmm10,%xmm8 # cos *x4 + mulsd %xmm7,%xmm9 # sin *x3 + + mulsd p_temp(%rsp),%xmm2 # low 0.5 * x2 * xx for sin term (cossin) + mulsd p_temp1+8(%rsp),%xmm13 # high 0.5 * x2 * xx for sin term (sincos) + + movsd %xmm12,%xmm6 # Keep high r for cos term + movsd %xmm3,%xmm7 # Keep low r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0 + + subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx (cossin) + subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx (sincos) + + movhlps %xmm0,%xmm10 # move high x for x*xx for cos term (cossin) + movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos) + + mulsd p_temp+8(%rsp),%xmm10 # x * xx + mulsd p_temp1(%rsp),%xmm1 # x * xx + + movsd %xmm12,%xmm2 # move -t for cos term + movsd %xmm3,%xmm13 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm12 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t) + + addsd p_temp(%rsp),%xmm4 # sin+xx + + addsd p_temp1+8(%rsp),%xmm9 # sin+xx + + + subsd %xmm6,%xmm12 # (1-t) - r + subsd %xmm7,%xmm3 # (1-t) - r + + subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx + subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx + + addsd %xmm0,%xmm4 # sin + x + + addsd %xmm11,%xmm9 # sin + x + + + addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx) + addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx) + + subsd %xmm2,%xmm8 # cos+t + subsd %xmm13,%xmm5 # cos+t + + movlhps %xmm8,%xmm4 # cossin + movlhps %xmm9,%xmm5 # sincos + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lsincos_sincos_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + movapd %xmm0,p_temp2(%rsp) # Store r + movapd %xmm1,p_temp3(%rsp) # Store r + + + movapd .Lcossinarray+0x50(%rip),%xmm4 # s6 + movapd .Lcossinarray+0x50(%rip),%xmm5 # s6 + movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3 + movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term + movhlps %xmm3,%xmm7 # move low x2 for x3 for sin term + mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term + mulsd p_temp3+8(%rsp),%xmm7 # get low x3 for sin term + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term + movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term + # Reverse 12 and 2 + + movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos + movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos + + mulsd %xmm6,%xmm8 # sin *x3 + mulsd %xmm7,%xmm9 # sin *x3 + mulsd %xmm10,%xmm4 # cos *x4 + mulsd %xmm11,%xmm5 # cos *x4 + + mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term + mulsd p_temp1+8(%rsp),%xmm13 # 0.5 * x2 * xx for sin term + movsd %xmm2,%xmm6 # Keep high r for cos term + movsd %xmm3,%xmm7 # Keep high r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0 + + subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx + subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx + + movhlps %xmm0,%xmm10 # move high x for x for sin term + movhlps %xmm1,%xmm11 # move high x for x for sin term + # Reverse 10 and 0 + + mulsd p_temp(%rsp),%xmm0 # x * xx + mulsd p_temp1(%rsp),%xmm1 # x * xx + + movsd %xmm2,%xmm12 # move -t for cos term + movsd %xmm3,%xmm13 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t) + addsd p_temp+8(%rsp),%xmm8 # sin+xx + addsd p_temp1+8(%rsp),%xmm9 # sin+xx + + subsd %xmm6,%xmm2 # (1-t) - r + subsd %xmm7,%xmm3 # (1-t) - r + + subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx + subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx + + addsd %xmm10,%xmm8 # sin + x + addsd %xmm11,%xmm9 # sin + x + + addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx) + addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx) + + subsd %xmm12,%xmm4 # cos+t + subsd %xmm13,%xmm5 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lcossin_sincos_piby4: # changed from sincos_sincos + # xmm1 is cossin and xmm0 is sincos +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + movapd %xmm0,p_temp2(%rsp) # Store r + + + movapd .Lcossinarray+0x50(%rip),%xmm4 # s6 + movapd .Lsincosarray+0x50(%rip),%xmm5 # s6 + movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3 + movdqa .Lsincosarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + movhlps %xmm11,%xmm11 # move high x4 for cos term + + + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term + movsd %xmm3,%xmm7 # move low x2 for x3 for sin term + + mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term + mulsd %xmm1,%xmm7 # get low x3 for sin term + + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term + movhlps %xmm3,%xmm13 # move high r for cos + + movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos + movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin + + mulsd %xmm6,%xmm8 # sin *x3 + mulsd %xmm11,%xmm9 # cos *x4 + mulsd %xmm10,%xmm4 # cos *x4 + mulsd %xmm7,%xmm5 # sin *x3 + + mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term + mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term + + movsd %xmm2,%xmm6 # Keep high r for cos term + movsd %xmm13,%xmm7 # Keep high r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t=r-1.0 + + subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx + subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx + + movhlps %xmm0,%xmm10 # move high x for x for sin term + movhlps %xmm1,%xmm11 # move high x for x*xx for cos term + + mulsd p_temp(%rsp),%xmm0 # x * xx + mulsd p_temp1+8(%rsp),%xmm11 # x * xx + + movsd %xmm2,%xmm12 # move -t for cos term + movsd %xmm13,%xmm3 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm13 # 1+(-t) + + addsd p_temp+8(%rsp),%xmm8 # sin+xx + addsd p_temp1(%rsp),%xmm5 # sin+xx + + subsd %xmm6,%xmm2 # (1-t) - r + subsd %xmm7,%xmm13 # (1-t) - r + + subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx + subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx + + + addsd %xmm10,%xmm8 # sin + x + addsd %xmm1,%xmm5 # sin + x + + addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx) + addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx) + + subsd %xmm12,%xmm4 # cos+t + subsd %xmm3,%xmm9 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + jmp .L__vrd4_cos_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_sinsin_piby4: + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + movapd %xmm2,p_temp2(%rsp) # store x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + movapd %xmm11,p_temp3(%rsp) # store r + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + mulpd %xmm2,%xmm10 # x4 + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + + movapd %xmm2,%xmm12 # copy of x2 for 0.5*x2 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) + + mulpd %xmm2,%xmm10 # x6 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd .L__real_3fe0000000000000(%rip),%xmm12 # 0.5 *x2 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm0,%xmm2 # x3 recalculate + mulpd %xmm3,%xmm3 # x4 recalculate + + movapd p_temp3(%rsp),%xmm13 # r + + mulpd %xmm6,%xmm12 # 0.5 * x2 *xx + mulpd %xmm1,%xmm7 # x * xx + + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # x4 * zc + + subpd %xmm12,%xmm4 # -0.5 * x2 *xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm6,%xmm4 # x3 * zs +xx + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + + subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1 + addpd %xmm0,%xmm4 # +x + subpd %xmm13,%xmm5 # + t + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lsinsin_coscos_piby4: + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + movapd %xmm3,p_temp3(%rsp) # store x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + movapd %xmm10,p_temp2(%rsp) # store r + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + mulpd %xmm3,%xmm11 # x4 + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for 0.5*x2 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm11 # x6 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd .L__real_3fe0000000000000(%rip),%xmm13 # 0.5 *x2 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zs + + mulpd %xmm2,%xmm2 # x4 recalculate + mulpd %xmm1,%xmm3 # x3 recalculate + + movapd p_temp2(%rsp),%xmm12 # r + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm7,%xmm13 # 0.5 * x2 *xx + subpd %xmm12,%xmm10 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x3 * zs + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx;;;;;;;;;;;;;;;;;;;;; + subpd %xmm13,%xmm5 # -0.5 * x2 *xx + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm7,%xmm5 # +xx + subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1 + addpd %xmm1,%xmm5 # +x + subpd %xmm12,%xmm4 # + t + + jmp .L__vrd4_cos_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_cossin_piby4: #Derive from cossin_coscos + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movapd .Lsincosarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm6,p_temp(%rsp) # rr + movhlps %xmm10,%xmm10 # get upper r for t for cos + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + movsd %xmm0,%xmm8 # lower x for sin + mulsd %xmm2,%xmm8 # lower x3 for sin + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # upper x4 for cos + movsd %xmm8,%xmm2 # lower x3 for sin + + movsd %xmm6,%xmm9 # lower xx + # note using odd reg + + movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term + movapd p_temp3(%rsp),%xmm13 # r + + mulpd %xmm0,%xmm6 # x * xx for upper cos term + mulpd %xmm1,%xmm7 # x * xx + movhlps %xmm6,%xmm6 + mulsd p_temp2(%rsp),%xmm9 # xx * 0.5*x2 for sin term + + subsd %xmm12,%xmm10 # (1 + (-t)) - r + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + # x3 * zs + + movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin + + subsd %xmm9,%xmm4 # x3zs - 0.5*x2*xx + + subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp(%rsp),%xmm4 # +xx + + + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + subsd %xmm12,%xmm8 # + t + addsd %xmm0,%xmm4 # +x + subpd %xmm13,%xmm5 # + t + + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lcoscos_sincos_piby4: #Derive from sincos_coscos + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movapd .Lcossinarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm6,p_temp(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zszc + addpd %xmm9,%xmm5 # z + + mulpd %xmm0,%xmm2 # upper x3 for sin + mulsd %xmm0,%xmm2 # lower x4 for cos + mulpd %xmm3,%xmm3 # x4 + + movhlps %xmm6,%xmm9 # upper xx for sin term + # note using odd reg + + movlpd p_temp2(%rsp),%xmm12 # lower r for cos term + movapd p_temp3(%rsp),%xmm13 # r + + + mulpd %xmm0,%xmm6 # x * xx for lower cos term + mulpd %xmm1,%xmm7 # x * xx + + mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term + + subsd %xmm12,%xmm10 # (1 + (-t)) - r + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + mulpd %xmm3,%xmm5 + # x4 * zc + + movhlps %xmm4,%xmm8 # xmm8= sin, xmm4= cos + subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx + + + subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp+8(%rsp),%xmm8 # +xx + + movhlps %xmm0,%xmm0 # upper x for sin + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + subsd %xmm12,%xmm4 # + t + subpd %xmm13,%xmm5 # + t + addsd %xmm0,%xmm8 # +x + + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lcossin_coscos_piby4: + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsincosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + movhlps %xmm11,%xmm11 # get upper r for t for cos + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zcs + + movsd %xmm1,%xmm9 # lower x for sin + mulsd %xmm3,%xmm9 # lower x3 for sin + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # upper x4 for cos + movsd %xmm9,%xmm3 # lower x3 for sin + + movsd %xmm7,%xmm8 # lower xx + # note using even reg + + movapd p_temp2(%rsp),%xmm12 # r + movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm1,%xmm7 # x * xx for upper cos term + movhlps %xmm7,%xmm7 + mulsd p_temp3(%rsp),%xmm8 # xx * 0.5*x2 for sin term + + subpd %xmm12,%xmm10 # (1 + (-t)) - r + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + # x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin + + subsd %xmm8,%xmm5 # x3zs - 0.5*x2*xx + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1(%rsp),%xmm5 # +xx + + + subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1 + subsd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1 + + subpd %xmm12,%xmm4 # + t + subsd %xmm13,%xmm9 # + t + addsd %xmm1,%xmm5 # +x + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lcossin_sinsin_piby4: # Derived from sincos_sinsin + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lsincosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + + movhlps %xmm11,%xmm11 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm6,%xmm10 # 0.5*x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zczs + + movsd %xmm3,%xmm12 + mulsd %xmm1,%xmm12 # low x3 for sin + + mulpd %xmm0,%xmm2 # x3 + mulpd %xmm3,%xmm3 # high x4 for cos + movsd %xmm12,%xmm3 # low x3 for sin + + movhlps %xmm1,%xmm8 # upper x for cos term + # note using even reg + movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term + + mulsd p_temp1+8(%rsp),%xmm8 # x * xx for upper cos term + + mulsd p_temp3(%rsp),%xmm7 # xx * 0.5*x2 for lower sin term + + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin + + subsd %xmm7,%xmm5 # x3zs - 0.5*x2*xx + + subsd %xmm8,%xmm11 # ((1 + (-t)) - r) - x*xx + + subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx + addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1(%rsp),%xmm5 # +xx + + addpd %xmm6,%xmm4 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + + addsd %xmm1,%xmm5 # +x + addpd %xmm0,%xmm4 # +x + subsd %xmm13,%xmm9 # + t + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lsincos_coscos_piby4: + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lcossinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm1,%xmm3 # upper x3 for sin + mulsd %xmm1,%xmm3 # lower x4 for cos + + movhlps %xmm7,%xmm8 # upper xx for sin term + # note using even reg + + movapd p_temp2(%rsp),%xmm12 # r + movlpd p_temp3(%rsp),%xmm13 # lower r for cos term + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm1,%xmm7 # x * xx for lower cos term + + mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term + + subpd %xmm12,%xmm10 # (1 + (-t)) - r + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos + + subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1+8(%rsp),%xmm9 # +xx + + movhlps %xmm1,%xmm1 # upper x for sin + subpd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + subpd %xmm12,%xmm4 # + t + subsd %xmm13,%xmm5 # + t + addsd %xmm1,%xmm9 # +x + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_cos_cleanup + + +.align 16 +.Lsincos_sinsin_piby4: # Derived from sincos_coscos + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lcossinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm6,%xmm10 # 0.5x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm0,%xmm2 # x3 + mulpd %xmm1,%xmm3 # upper x3 for sin + mulsd %xmm1,%xmm3 # lower x4 for cos + + movhlps %xmm7,%xmm8 # upper xx for sin term + # note using even reg + + movlpd p_temp3(%rsp),%xmm13 # lower r for cos term + + mulpd %xmm1,%xmm7 # x * xx for lower cos term + + mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term + + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos + + subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx + + subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx + addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1+8(%rsp),%xmm9 # +xx + + movhlps %xmm1,%xmm1 # upper x for sin + addpd %xmm6,%xmm4 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + addsd %xmm1,%xmm9 # +x + addpd %xmm0,%xmm4 # +x + subsd %xmm13,%xmm5 # + t + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_cos_cleanup + + +.align 16 +.Lsinsin_cossin_piby4: # Derived from sincos_sinsin + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lsincosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # x2 + movapd %xmm6,p_temp(%rsp) # xx + + movhlps %xmm10,%xmm10 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm7,%xmm11 # 0.5*x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zs + + + movsd %xmm2,%xmm13 + mulsd %xmm0,%xmm13 # low x3 for sin + + mulpd %xmm1,%xmm3 # x3 + mulpd %xmm2,%xmm2 # high x4 for cos + movsd %xmm13,%xmm2 # low x3 for sin + + + movhlps %xmm0,%xmm9 # upper x for cos term ; note using even reg + movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term + mulsd p_temp+8(%rsp),%xmm9 # x * xx for upper cos term + mulsd p_temp2(%rsp),%xmm6 # xx * 0.5*x2 for lower sin term + subsd %xmm12,%xmm10 # (1 + (-t)) - r + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin + subsd %xmm6,%xmm4 # x3zs - 0.5*x2*xx + + subsd %xmm9,%xmm10 # ((1 + (-t)) - r) - x*xx + + subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx + + addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp(%rsp),%xmm4 # +xx + + addpd %xmm7,%xmm5 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + + addsd %xmm0,%xmm4 # +x + addpd %xmm1,%xmm5 # +x + subsd %xmm12,%xmm8 # + t + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_cos_cleanup + +.align 16 +.Lsinsin_sincos_piby4: # Derived from sincos_coscos + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcossinarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm6,p_temp(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm7,%xmm11 # 0.5x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm1,%xmm3 # x3 + mulpd %xmm0,%xmm2 # upper x3 for sin + mulsd %xmm0,%xmm2 # lower x4 for cos + + movhlps %xmm6,%xmm9 # upper xx for sin term + # note using even reg + + movlpd p_temp2(%rsp),%xmm12 # lower r for cos term + + mulpd %xmm0,%xmm6 # x * xx for lower cos term + + mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term + + subsd %xmm12,%xmm10 # (1 + (-t)) - r + + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm4,%xmm8 # xmm9= sin, xmm5= cos + + subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx + + subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + + subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx + addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp+8(%rsp),%xmm8 # +xx + + movhlps %xmm0,%xmm0 # upper x for sin + addpd %xmm7,%xmm5 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + + + addsd %xmm0,%xmm8 # +x + addpd %xmm1,%xmm5 # +x + subsd %xmm12,%xmm4 # + t + + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_cos_cleanup + + +.align 16 +.Lsinsin_sinsin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + movapd %xmm2,p_temp2(%rsp) # copy of x2 + movapd %xmm3,p_temp3(%rsp) # copy of x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm2,%xmm10 # x6 + mulpd %xmm3,%xmm11 # x6 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5 *x2 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm6,%xmm2 # 0.5 * x2 *xx + mulpd %xmm7,%xmm3 # 0.5 * x2 *xx + + mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zs + + movapd p_temp2(%rsp),%xmm10 # x2 + movapd p_temp3(%rsp),%xmm11 # x2 + + mulpd %xmm0,%xmm10 # x3 + mulpd %xmm1,%xmm11 # x3 + + mulpd %xmm10,%xmm4 # x3 * zs + mulpd %xmm11,%xmm5 # x3 * zs + + subpd %xmm2,%xmm4 # -0.5 * x2 *xx + subpd %xmm3,%xmm5 # -0.5 * x2 *xx + + addpd %xmm6,%xmm4 # +xx + addpd %xmm7,%xmm5 # +xx + + addpd %xmm0,%xmm4 # +x + addpd %xmm1,%xmm5 # +x + + jmp .L__vrd4_cos_cleanup
diff --git a/src/gas/vrdaexp.S b/src/gas/vrdaexp.S new file mode 100644 index 0000000..1ee640e --- /dev/null +++ b/src/gas/vrdaexp.S
@@ -0,0 +1,619 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrdaexp.asm +# +# An array implementation of the exp libm function. +# +# Prototype: +# +# void vrda_exp(int n, double *x, double *y); +# +# Computes e raised to the x power for an array of input values. +# Places the results into the supplied y array. +# Does not perform error checking. Denormal results are truncated to 0. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# define local variable storage offsets +.equ p_temp,0 # temporary for get/put bits operation +.equ p_temp1,0x10 # temporary for exponent multiply + +.equ save_xa,0x020 #qword +.equ save_ya,0x028 #qword +.equ save_nv,0x030 #qword + +.equ p_iter,0x038 # qword storage for number of loop iterations + +.equ p2_temp,0x40 # second temporary for get/put bits operation + # large enough for two vectors +.equ p2_temp1,0x60 # second temporary for exponent multiply + # large enough for two vectors +.equ save_rbx,0x080 #qword + +.equ stack_size,0x088 + + .weak vrda_exp_ + .set vrda_exp_,__vrda_exp__ + .weak vrda_exp__ + .set vrda_exp__,__vrda_exp__ + + .text + .align 16 + .p2align 4,,15 + +#x/* a FORTRAN subroutine implementation of array exp +#** VRDA_EXP(N,X,Y) +# C equivalent*/ +#void vrda_exp__(int * n, double *x, double *y) +#{ +# vrda_exp(*n,x,y); +#} +.globl __vrda_exp__ + .type __vrda_exp__,@function +__vrda_exp__: + mov (%rdi),%edi + + + .align 16 + .p2align 4,,15 + + +# parameters are passed in by gcc as: +# edi - int n +# rsi - double *x +# rdx - double *y + + +.globl vrda_exp + .type vrda_exp,@function +vrda_exp: + + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) + +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + + mov %rdi,save_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vda_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +# In this second version, process the array 4 values at a time. + +.L__vda_top: +# build the input _m128d + movapd .L__real_thirtytwo_by_log2(%rip),%xmm3 # + mov save_xa(%rsp),%rsi # get x_array pointer + movlpd (%rsi),%xmm0 + movhpd 8(%rsi),%xmm0 + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + +# compute the exponents + +# Step 1. Reduce the argument. +# /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */ +# r = x * thirtytwo_by_logbaseof2; + movapd %xmm3,%xmm7 + movapd %xmm0,p_temp(%rsp) + maxpd .L__real_C0F0000000000000(%rip),%xmm0 # protect against very large negative, non-infinite numbers + mulpd %xmm0,%xmm3 + + movlpd -16(%rsi),%xmm6 + movhpd -8(%rsi),%xmm6 + movapd %xmm6,p2_temp(%rsp) + maxpd .L__real_C0F0000000000000(%rip),%xmm6 + mulpd %xmm6,%xmm7 + +# save x for later. + minpd .L__real_40F0000000000000(%rip),%xmm3 # protect against very large, non-infinite numbers + +# /* Set n = nearest integer to r */ + cvtpd2dq %xmm3,%xmm4 + lea .L__two_to_jby32_lead_table(%rip),%rdi + lea .L__two_to_jby32_trail_table(%rip),%rsi + cvtdq2pd %xmm4,%xmm1 + minpd .L__real_40F0000000000000(%rip),%xmm7 # protect against very large, non-infinite numbers + + # r1 = x - n * logbaseof2_by_32_lead; + movapd .L__real_log2_by_32_lead(%rip),%xmm2 # + mulpd %xmm1,%xmm2 # + movq %xmm4,p_temp1(%rsp) + subpd %xmm2,%xmm0 # r1 in xmm0, + + cvtpd2dq %xmm7,%xmm2 + cvtdq2pd %xmm2,%xmm8 + +# r2 = - n * logbaseof2_by_32_trail; + mulpd .L__real_log2_by_32_tail(%rip),%xmm1 # r2 in xmm1 +# j = n & 0x0000001f; + mov $0x01f,%r9 + mov %r9,%r8 + mov p_temp1(%rsp),%ecx + and %ecx,%r9d + movq %xmm2,p2_temp1(%rsp) + movapd .L__real_log2_by_32_lead(%rip),%xmm9 + mulpd %xmm8,%xmm9 + subpd %xmm9,%xmm6 # r1b in xmm6 + mulpd .L__real_log2_by_32_tail(%rip),%xmm8 # r2b in xmm8 + + mov p_temp1+4(%rsp),%edx + and %edx,%r8d +# f1 = two_to_jby32_lead_table[j]; +# f2 = two_to_jby32_trail_table[j]; + +# *m = (n - j) / 32; + sub %r9d,%ecx + sar $5,%ecx #m + sub %r8d,%edx + sar $5,%edx + + + movapd %xmm0,%xmm2 + addpd %xmm1,%xmm2 # r = r1 + r2 + + mov $0x01f,%r11 + mov %r11,%r10 + mov p2_temp1(%rsp),%ebx + and %ebx,%r11d +# Step 2. Compute the polynomial. +# q = r1 + (r2 + +# r*r*( 5.00000000000000008883e-01 + +# r*( 1.66666666665260878863e-01 + +# r*( 4.16666666662260795726e-02 + +# r*( 8.33336798434219616221e-03 + +# r*( 1.38889490863777199667e-03 )))))); +# q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720 + movapd %xmm2,%xmm1 + movapd .L__real_3f56c1728d739765(%rip),%xmm3 # 1/720 + movapd .L__real_3FC5555555548F7C(%rip),%xmm0 # 1/6 +# deal with infinite results + mov $1024,%rax + movsx %ecx,%rcx + cmp %rax,%rcx + + mulpd %xmm2,%xmm3 # *x + mulpd %xmm2,%xmm0 # *x + mulpd %xmm2,%xmm1 # x*x + movapd %xmm1,%xmm4 + + cmovg %rax,%rcx ## if infinite, then set rcx to multiply + # by infinity + movsx %edx,%rdx + cmp %rax,%rdx + + movapd %xmm6,%xmm9 + addpd %xmm8,%xmm9 # rb = r1b + r2b + addpd .L__real_3F811115B7AA905E(%rip),%xmm3 # + 1/120 + addpd .L__real_3fe0000000000000(%rip),%xmm0 # + .5 + mulpd %xmm1,%xmm4 # x^4 + mulpd %xmm2,%xmm3 # *x + + cmovg %rax,%rdx ## if infinite, then set rcx to multiply + # by infinity +# deal with denormal results + xor %rax,%rax + add $1023,%rcx # add bias + + mulpd %xmm1,%xmm0 # *x^2 + addpd .L__real_3FA5555555545D4E(%rip),%xmm3 # + 1/24 + addpd %xmm2,%xmm0 # + x + mulpd %xmm4,%xmm3 # *x^4 + +# check for infinity or nan + movapd p_temp(%rsp),%xmm2 + + cmovs %rax,%rcx ## if denormal, then multiply by 0 + shl $52,%rcx # build 2^n + + sub %r11d,%ebx + movapd %xmm9,%xmm1 + addpd %xmm3,%xmm0 # q = final sum + movapd .L__real_3f56c1728d739765(%rip),%xmm7 # 1/720 + movapd .L__real_3FC5555555548F7C(%rip),%xmm3 # 1/6 + +# *z2 = f2 + ((f1 + f2) * q); + movlpd (%rsi,%r9,8),%xmm5 # f2 + movlpd (%rsi,%r8,8),%xmm4 # f2 + addsd (%rdi,%r8,8),%xmm4 # f1 + f2 + addsd (%rdi,%r9,8),%xmm5 # f1 + f2 + mov p2_temp1+4(%rsp),%r8d + and %r8d,%r10d + sar $5,%ebx #m + mulpd %xmm9,%xmm7 # *x + mulpd %xmm9,%xmm3 # *x + mulpd %xmm9,%xmm1 # x*x + sub %r10d,%r8d + sar $5,%r8d +# check for infinity or nan + andpd .L__real_infinity(%rip),%xmm2 + cmppd $0,.L__real_infinity(%rip),%xmm2 + add $1023,%rdx # add bias + shufpd $0,%xmm4,%xmm5 + movapd %xmm1,%xmm4 + + cmovs %rax,%rdx ## if denormal, then multiply by 0 + shl $52,%rdx # build 2^n + + mulpd %xmm5,%xmm0 + mov %rcx,p_temp1(%rsp) # get 2^n to memory + mov %rdx,p_temp1+8(%rsp) # get 2^n to memory + addpd %xmm5,%xmm0 #z = z1 + z2 done with 1,2,3,4,5 + mov $1024,%rax + movsx %ebx,%rbx + cmp %rax,%rbx +# end of splitexp +# /* Scale (z1 + z2) by 2.0**m */ +# r = scaleDouble_1(z, n); + + + cmovg %rax,%rbx ## if infinite, then set rcx to multiply + # by infinity + movsx %r8d,%rdx + cmp %rax,%rdx + + movmskpd %xmm2,%r8d + + addpd .L__real_3F811115B7AA905E(%rip),%xmm7 # + 1/120 + addpd .L__real_3fe0000000000000(%rip),%xmm3 # + .5 + mulpd %xmm1,%xmm4 # x^4 + mulpd %xmm9,%xmm7 # *x + cmovg %rax,%rdx ## if infinite, then set rcx to multiply + + + xor %rax,%rax + add $1023,%rbx # add bias + + mulpd %xmm1,%xmm3 # *x^2 + addpd .L__real_3FA5555555545D4E(%rip),%xmm7 # + 1/24 + addpd %xmm9,%xmm3 # + x + mulpd %xmm4,%xmm7 # *x^4 + + cmovs %rax,%rbx ## if denormal, then multiply by 0 + shl $52,%rbx # build 2^n + +# Step 3. Reconstitute. + + mulpd p_temp1(%rsp),%xmm0 # result *= 2^n + addpd %xmm7,%xmm3 # q = final sum + + movlpd (%rsi,%r11,8),%xmm5 # f2 + movlpd (%rsi,%r10,8),%xmm4 # f2 + addsd (%rdi,%r10,8),%xmm4 # f1 + f2 + addsd (%rdi,%r11,8),%xmm5 # f1 + f2 + + add $1023,%rdx # add bias + cmovs %rax,%rdx ## if denormal, then multiply by 0 + shufpd $0,%xmm4,%xmm5 + shl $52,%rdx # build 2^n + + mulpd %xmm5,%xmm3 + mov %rbx,p2_temp1(%rsp) # get 2^n to memory + mov %rdx,p2_temp1+8(%rsp) # get 2^n to memory + addpd %xmm5,%xmm3 #z = z1 + z2 + + movapd p2_temp(%rsp),%xmm2 + andpd .L__real_infinity(%rip),%xmm2 + cmppd $0,.L__real_infinity(%rip),%xmm2 + movmskpd %xmm2,%ebx + test $3,%r8d + mulpd p2_temp1(%rsp),%xmm3 # result *= 2^n +# we'd like to avoid a branch, and can use cmp's and and's to +# eliminate them. But it adds cycles for normal cases which +# are supposed to be exceptions. Using this branch with the +# check above results in faster code for the normal cases. + jnz .L__exp_naninf + +.L__vda_bottom1: +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + movlpd %xmm0,(%rdi) + movhpd %xmm0,8(%rdi) + test $3,%ebx + jnz .L__exp_naninf2 + +.L__vda_bottom2: + + prefetch 64(%rdi) + add $32,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + +# store the result _m128d + movlpd %xmm3,-16(%rdi) + movhpd %xmm3,-8(%rdi) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vda_top + + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vda_cleanup + + +# +# +.L__final_check: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + +# at least one of the numbers needs special treatment +.L__exp_naninf: + lea p_temp(%rsp),%rcx + call .L__naninf + jmp .L__vda_bottom1 +.L__exp_naninf2: + lea p2_temp(%rsp),%rcx + mov %ebx,%r8d + movapd %xmm3,%xmm0 + call .L__naninf + movapd %xmm0,%xmm3 + jmp .L__vda_bottom2 + +# This subroutine checks a double pair for nans and infinities and +# produces the proper result from the exceptional inputs +# Register assumptions: +# Inputs: +# r8d - mask of errors +# xmm0 - computed result vector +# rcx - pointing to memory image of inputs +# Outputs: +# xmm0 - new result vector +# %rax,rdx,,%xmm2 all modified. +.L__naninf: +# check the first number + test $1,%r8d + jz .L__check2 + + mov (%rcx),%rdx + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__enan1 # jump if mantissa not zero, so it's a NaN +# inf + mov %rdx,%rax + rcl $1,%rax + jnc .L__r1 # exp(+inf) = inf + xor %rdx,%rdx # exp(-inf) = 0 + jmp .L__r1 + +#NaN +.L__enan1: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__r1: + movd %rdx,%xmm2 + shufpd $2,%xmm0,%xmm2 + movsd %xmm2,%xmm0 +# check the second number +.L__check2: + test $2,%r8d + jz .L__r3 + mov 8(%rcx),%rdx + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__enan2 # jump if mantissa not zero, so it's a NaN +# inf + mov %rdx,%rax + rcl $1,%rax + jnc .L__r2 # exp(+inf) = inf + xor %rdx,%rdx # exp(-inf) = 0 + jmp .L__r2 + +#NaN +.L__enan2: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__r2: + movd %rdx,%xmm2 + shufpd $0,%xmm2,%xmm0 +.L__r3: + ret + + .align 16 +# we jump here when we have an odd number of exp calls to make at the +# end +# we assume that rdx is pointing at the next x array element, +# r8 at the next y array element. The number of values left is in +# save_nv +.L__vda_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__final_check # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi +# fill in a m128d with zeroes and the extra values and then make a recursive call. + xorpd %xmm0,%xmm0 + movlpd %xmm0,p2_temp+8(%rsp) + movapd %xmm0,p2_temp+16(%rsp) + + mov (%rsi),%rcx # we know there's at least one + mov %rcx,p2_temp(%rsp) + cmp $2,%rax + jl .L_vdacg + + mov 8(%rsi),%rcx # do the second value + mov %rcx,p2_temp+8(%rsp) + cmp $3,%rax + jl .L_vdacg + + mov 16(%rsi),%rcx # do the third value + mov %rcx,p2_temp+16(%rsp) + +.L_vdacg: + mov $4,%rdi # parameter for N + lea p2_temp(%rsp),%rsi # &x parameter + lea p2_temp1(%rsp),%rdx # &y parameter + call vrda_exp@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + mov p2_temp1(%rsp),%rcx + mov %rcx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L_vdacgf + + mov p2_temp1+8(%rsp),%rcx + mov %rcx,8(%rdi) # do the second value + cmp $3,%rax + jl .L_vdacgf + + mov p2_temp1+16(%rsp),%rcx + mov %rcx,16(%rdi) # do the third value + +.L_vdacgf: + jmp .L__final_check + + .data + .align 64 + + +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 # for alignment +.L__real_4040000000000000: .quad 0x04040000000000000 # 32 + .quad 0x04040000000000000 +.L__real_40F0000000000000: .quad 0x040F0000000000000 # 65536, to protect against really large numbers + .quad 0x040F0000000000000 +.L__real_C0F0000000000000: .quad 0x0C0F0000000000000 # -65536, to protect against really large negative numbers + .quad 0x0C0F0000000000000 +.L__real_3FA0000000000000: .quad 0x03FA0000000000000 # 1/32 + .quad 0x03FA0000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 1/2 + .quad 0x03fe0000000000000 +.L__real_infinity: .quad 0x07ff0000000000000 # + .quad 0x07ff0000000000000 # for alignment +.L__real_ninfinity: .quad 0x0fff0000000000000 # + .quad 0x0fff0000000000000 # for alignment +.L__real_thirtytwo_by_log2: .quad 0x040471547652b82fe # thirtytwo_by_log2 + .quad 0x040471547652b82fe +.L__real_log2_by_32_lead: .quad 0x03f962e42fe000000 # log2_by_32_lead + .quad 0x03f962e42fe000000 +.L__real_log2_by_32_tail: .quad 0x0Bdcf473de6af278e # -log2_by_32_tail + .quad 0x0Bdcf473de6af278e +.L__real_3f56c1728d739765: .quad 0x03f56c1728d739765 # 1.38889490863777199667e-03 + .quad 0x03f56c1728d739765 +.L__real_3F811115B7AA905E: .quad 0x03F811115B7AA905E # 8.33336798434219616221e-03 + .quad 0x03F811115B7AA905E +.L__real_3FA5555555545D4E: .quad 0x03FA5555555545D4E # 4.16666666662260795726e-02 + .quad 0x03FA5555555545D4E +.L__real_3FC5555555548F7C: .quad 0x03FC5555555548F7C # 1.66666666665260878863e-01 + .quad 0x03FC5555555548F7C + + +.L__two_to_jby32_lead_table: + .quad 0x03ff0000000000000 # 1 + .quad 0x03ff059b0d0000000 # 1.0219 + .quad 0x03ff0b55860000000 # 1.04427 + .quad 0x03ff11301d0000000 # 1.06714 + .quad 0x03ff172b830000000 # 1.09051 + .quad 0x03ff1d48730000000 # 1.11439 + .quad 0x03ff2387a60000000 # 1.13879 + .quad 0x03ff29e9df0000000 # 1.16372 + .quad 0x03ff306fe00000000 # 1.18921 + .quad 0x03ff371a730000000 # 1.21525 + .quad 0x03ff3dea640000000 # 1.24186 + .quad 0x03ff44e0860000000 # 1.26905 + .quad 0x03ff4bfdad0000000 # 1.29684 + .quad 0x03ff5342b50000000 # 1.32524 + .quad 0x03ff5ab07d0000000 # 1.35426 + .quad 0x03ff6247eb0000000 # 1.38391 + .quad 0x03ff6a09e60000000 # 1.41421 + .quad 0x03ff71f75e0000000 # 1.44518 + .quad 0x03ff7a11470000000 # 1.47683 + .quad 0x03ff8258990000000 # 1.50916 + .quad 0x03ff8ace540000000 # 1.54221 + .quad 0x03ff93737b0000000 # 1.57598 + .quad 0x03ff9c49180000000 # 1.61049 + .quad 0x03ffa5503b0000000 # 1.64576 + .quad 0x03ffae89f90000000 # 1.68179 + .quad 0x03ffb7f76f0000000 # 1.71862 + .quad 0x03ffc199bd0000000 # 1.75625 + .quad 0x03ffcb720d0000000 # 1.79471 + .quad 0x03ffd5818d0000000 # 1.83401 + .quad 0x03ffdfc9730000000 # 1.87417 + .quad 0x03ffea4afa0000000 # 1.91521 + .quad 0x03fff507650000000 # 1.95714 + .quad 0 # for alignment +.L__two_to_jby32_trail_table: + .quad 0x00000000000000000 # 0 + .quad 0x03e48ac2ba1d73e2a # 1.1489e-008 + .quad 0x03e69f3121ec53172 # 4.83347e-008 + .quad 0x03df25b50a4ebbf1b # 2.67125e-010 + .quad 0x03e68faa2f5b9bef9 # 4.65271e-008 + .quad 0x03e368b9aa7805b80 # 5.24924e-009 + .quad 0x03e6ceac470cd83f6 # 5.38622e-008 + .quad 0x03e547f7b84b09745 # 1.90902e-008 + .quad 0x03e64636e2a5bd1ab # 3.79764e-008 + .quad 0x03e5ceaa72a9c5154 # 2.69307e-008 + .quad 0x03e682468446b6824 # 4.49684e-008 + .quad 0x03e18624b40c4dbd0 # 1.41933e-009 + .quad 0x03e54d8a89c750e5e # 1.94147e-008 + .quad 0x03e5a753e077c2a0f # 2.46409e-008 + .quad 0x03e6a90a852b19260 # 4.94813e-008 + .quad 0x03e0d2ac258f87d03 # 8.48872e-010 + .quad 0x03e59fcef32422cbf # 2.42032e-008 + .quad 0x03e61d8bee7ba46e2 # 3.3242e-008 + .quad 0x03e4f580c36bea881 # 1.45957e-008 + .quad 0x03e62999c25159f11 # 3.46453e-008 + .quad 0x03e415506dadd3e2a # 8.0709e-009 + .quad 0x03e29b8bc9e8a0388 # 2.99439e-009 + .quad 0x03e451f8480e3e236 # 9.83622e-009 + .quad 0x03e41f12ae45a1224 # 8.35492e-009 + .quad 0x03e62b5a75abd0e6a # 3.48493e-008 + .quad 0x03e47daf237553d84 # 1.11085e-008 + .quad 0x03e6b0aa538444196 # 5.03689e-008 + .quad 0x03e69df20d22a0798 # 4.81896e-008 + .quad 0x03e69f7490e4bb40b # 4.83654e-008 + .quad 0x03e4bdcdaf5cb4656 # 1.29746e-008 + .quad 0x03e452486cc2c7b9d # 9.84533e-009 + .quad 0x03e66dc8a80ce9f09 # 4.25828e-008 + .quad 0 # for alignment +
diff --git a/src/gas/vrdalog.S b/src/gas/vrdalog.S new file mode 100644 index 0000000..cdbba18 --- /dev/null +++ b/src/gas/vrdalog.S
@@ -0,0 +1,954 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrdalog.s +# +# An array implementation of the log libm function. +# +# Prototype: +# +# void vrda_log(int n, double *x, double *y); +# +# Computes the natural log of x. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. This version can compute logs in 44 +# cycles with n <= 24 +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# define local variable storage offsets +.equ p_x,0 # temporary for error checking operation +.equ p_idx,0x010 # index storage +.equ p_xexp,0x020 # index storage + +.equ p_x2,0x030 # temporary for error checking operation +.equ p_idx2,0x040 # index storage +.equ p_xexp2,0x050 # index storage + +.equ save_xa,0x060 #qword +.equ save_ya,0x068 #qword +.equ save_nv,0x070 #qword +.equ p_iter,0x078 # qword storage for number of loop iterations + +.equ save_rbx,0x080 #qword + + +.equ p2_temp,0x090 # second temporary for get/put bits operation +.equ p2_temp1,0x0b0 # second temporary for exponent multiply + +.equ p_n1,0x0c0 # temporary for near one check +.equ p_n12,0x0d0 # temporary for near one check + + +.equ stack_size,0x0e8 + + .weak vrda_log_ + .set vrda_log_,__vrda_log__ + .weak vrda_log__ + .set vrda_log__,__vrda_log__ + +# parameters are passed in by Linux as: +# rdi - int n +# rsi - double *x +# rdx - double *y + + .text + .align 16 + .p2align 4,,15 + +#x/* a FORTRAN subroutine implementation of array log +#** VRDA_LOG(N,X,Y) +# C equivalent*/ +#void vrda_log__(int * n, double *x, double *y) +#{ +# vrda_log(*n,x,y); +#} +.globl __vrda_log__ + .type __vrda_log__,@function +__vrda_log__: + mov (%rdi),%edi + + .align 16 + .p2align 4,,15 +.globl vrda_log + .type vrda_log,@function +vrda_log: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx + +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + + mov %rdi,save_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vda_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +# In this second version, process the array 2 values at a time. + +.L__vda_top: +# build the input _m128d + mov save_xa(%rsp),%rsi # get x_array pointer + movlpd (%rsi),%xmm0 + movhpd 8(%rsi),%xmm0 + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + movlpd -16(%rsi),%xmm7 + movhpd -8(%rsi),%xmm7 + +# compute the logs + +## if NaN or inf + movdqa %xmm0,p_x(%rsp) # save the input values + +# /* Store the exponent of x in xexp and put +# f into the range [0.5,1) */ + + pxor %xmm1,%xmm1 + movdqa %xmm0,%xmm3 + psrlq $52,%xmm3 + psubq .L__mask_1023(%rip),%xmm3 + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm6 # xexp + movdqa %xmm7,p_x2(%rsp) # save the input values + movdqa %xmm0,%xmm2 + subpd .L__real_one(%rip),%xmm2 + + movapd %xmm6,p_xexp(%rsp) + andpd .L__real_notsign(%rip),%xmm2 + xor %rax,%rax + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + + cmppd $1,.L__real_threshold(%rip),%xmm2 + movmskpd %xmm2,%ecx + movdqa %xmm3,%xmm4 + mov %ecx,p_n1(%rsp) + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrlq $45,%xmm3 + movdqa %xmm3,%xmm2 + psrlq $1,%xmm3 + paddq .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm2 + paddq %xmm2,%xmm3 + + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm1 + pxor %xmm7,%xmm7 + movdqa p_x2(%rsp),%xmm2 + movapd p_x2(%rsp),%xmm5 + psrlq $52,%xmm2 + psubq .L__mask_1023(%rip),%xmm2 + packssdw %xmm7,%xmm2 + subpd .L__real_one(%rip),%xmm5 + andpd .L__real_notsign(%rip),%xmm5 + cvtdq2pd %xmm2,%xmm6 # xexp + xor %rcx,%rcx + cmppd $1,.L__real_threshold(%rip),%xmm5 + movq %xmm3,p_idx(%rsp) + +# reduce and get u + por .L__real_half(%rip),%xmm4 + movdqa %xmm4,%xmm2 + movapd %xmm6,p_xexp2(%rsp) + + # do near one check + movmskpd %xmm5,%edx + mov %edx,p_n12(%rsp) + + mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128 + + + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx(%rsp),%eax + movdqa p_x2(%rsp),%xmm6 + + movapd .L__real_half(%rip),%xmm5 # .5 + subpd %xmm1,%xmm2 # f2 = f - f1 + pand .L__real_mant(%rip),%xmm6 + mulpd %xmm2,%xmm5 + addpd %xmm5,%xmm1 + + movdqa %xmm6,%xmm8 + psrlq $45,%xmm6 + movdqa %xmm6,%xmm4 + + psrlq $1,%xmm6 + paddq .L__mask_040(%rip),%xmm6 + pand .L__mask_001(%rip),%xmm4 + paddq %xmm4,%xmm6 +# do error checking here for scheduling. Saves a bunch of cycles as +# compared to doing this at the start of the routine. +## if NaN or inf + movapd %xmm0,%xmm3 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r8d + packssdw %xmm7,%xmm6 + por .L__real_half(%rip),%xmm8 + movq %xmm6,p_idx2(%rsp) + cvtdq2pd %xmm6,%xmm9 + + cmppd $2,.L__real_zero(%rip),%xmm0 + mulpd .L__real_3f80000000000000(%rip),%xmm9 # f1 = index/128 + movmskpd %xmm0,%r9d +# delaying this divide helps, but moving the other one does not. +# it was after the paddq + divpd %xmm1,%xmm2 # u + +# compute the index into the log tables +# + + movlpd -512(%rdx,%rax,8),%xmm0 # z1 + mov p_idx+4(%rsp),%ecx + movhpd -512(%rdx,%rcx,8),%xmm0 # z1 +# solve for ln(1+u) + movapd %xmm2,%xmm1 # u + mulpd %xmm2,%xmm2 # u^2 + movapd %xmm2,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm2,%xmm3 #Cu2 + mulpd %xmm1,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm2 # u^5 + movapd .L__real_log2_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm1 # u+Au3 + mulpd %xmm3,%xmm2 # u5(B+Cu2) + + movapd p_xexp(%rsp),%xmm5 # xexp + addpd %xmm2,%xmm1 # poly +# recombine + mulpd %xmm5,%xmm4 # xexp * log2_lead + addpd %xmm4,%xmm0 #r1 + lea .L__np_ln_tail_table(%rip),%rdx + movlpd -512(%rdx,%rax,8),%xmm4 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm4 #z2 +=q + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx2(%rsp),%eax + mov p_idx2+4(%rsp),%ecx + addpd %xmm4,%xmm1 + + mulpd .L__real_log2_tail(%rip),%xmm5 + + movapd .L__real_half(%rip),%xmm4 # .5 + subpd %xmm9,%xmm8 # f2 = f - f1 + mulpd %xmm8,%xmm4 + addpd %xmm4,%xmm9 + + addpd %xmm5,%xmm1 #r2 + divpd %xmm9,%xmm8 # u + movapd p_x2(%rsp),%xmm3 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r10d + movapd p_x2(%rsp),%xmm6 + cmppd $2,.L__real_zero(%rip),%xmm6 + movmskpd %xmm6,%r11d + +# check for nans/infs + test $3,%r8d + addpd %xmm1,%xmm0 + jnz .L__log_naninf +.L__vlog1: +# check for negative numbers or zero + test $3,%r9d + jnz .L__z_or_n + +.L__vlog2: +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + movlpd %xmm0,(%rdi) + movhpd %xmm0,8(%rdi) + + # It seems like a good idea to try and interleave + # even more of the following code sooner into the + # program. But there were conflicts with the table + # index registers, making the problem difficult. + # After a lot of work in a branch of this file, + # I was not able to match the speed of this version. + # CodeAnalyst shows that there is lots of unused add + # pipe time around the divides, but the processor + # doesn't seem to be able to schedule in those slots. + + movlpd -512(%rdx,%rax,8),%xmm7 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm7 #z2 +=q + +# check for near one + mov p_n1(%rsp),%r9d + test $3,%r9d + jnz .L__near_one1 +.L__vlog2n: + + # solve for ln(1+u) + movapd %xmm8,%xmm9 # u + mulpd %xmm8,%xmm8 # u^2 + movapd %xmm8,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm8,%xmm3 #Cu2 + mulpd %xmm9,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm8 # u^5 + movapd .L__real_log2_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm9 # u+Au3 + mulpd %xmm3,%xmm8 # u5(B+Cu2) + + movapd p_xexp2(%rsp),%xmm5 # xexp + addpd %xmm8,%xmm9 # poly + # recombine + mulpd %xmm5,%xmm4 + addpd %xmm4,%xmm7 #r1 + lea .L__np_ln_tail_table(%rip),%rdx + movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q + addpd %xmm2,%xmm9 + + mulpd .L__real_log2_tail(%rip),%xmm5 + + addpd %xmm5,%xmm9 #r2 + + # check for nans/infs + test $3,%r10d + addpd %xmm9,%xmm7 + jnz .L__log_naninf2 +.L__vlog3: +# check for negative numbers or zero + test $3,%r11d + jnz .L__z_or_n2 + +.L__vlog4: + mov p_n12(%rsp),%r9d + test $3,%r9d + jnz .L__near_one2 + +.L__vlog4n: + + +#__vda_bottom2: + + prefetch 64(%rdi) + add $32,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + +# store the result _m128d + movlpd %xmm7,-16(%rdi) + movhpd %xmm7,-8(%rdi) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vda_top + + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vda_cleanup + + +.L__finish: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + + .align 16 +.Lboth_nearone: +# saves 10 cycles +# r = x - 1.0; + movapd .L__real_two(%rip),%xmm2 + subpd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addpd %xmm0,%xmm2 + movapd %xmm0,%xmm1 + divpd %xmm2,%xmm1 # u + movapd .L__real_ca4(%rip),%xmm4 #D + movapd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movapd %xmm0,%xmm6 + mulpd %xmm1,%xmm6 # correction +# u = u + u; + addpd %xmm1,%xmm1 #u + movapd %xmm1,%xmm2 + mulpd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulpd %xmm1,%xmm5 # Cu + movapd %xmm1,%xmm3 + mulpd %xmm2,%xmm3 # u^3 + mulpd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulpd %xmm3,%xmm4 #Du^3 + + addpd .L__real_ca1(%rip),%xmm2 # +A + movapd %xmm3,%xmm1 + mulpd %xmm1,%xmm1 # u^6 + addpd %xmm4,%xmm5 #Cu+Du3 + + mulpd %xmm3,%xmm2 #u3(A+Bu2) + mulpd %xmm5,%xmm1 #u6(Cu+Du3) + addpd %xmm1,%xmm2 + subpd %xmm6,%xmm2 # -correction + +# return r + r2; + addpd %xmm2,%xmm0 + ret + + .align 16 +.L__near_one1: + cmp $3,%r9d + jnz .L__n1nb1 + + movapd p_x(%rsp),%xmm0 + call .Lboth_nearone + movlpd %xmm0,(%rdi) + movhpd %xmm0,8(%rdi) + jmp .L__vlog2n + + .align 16 +.L__n1nb1: + test $1,%r9d + jz .L__lnn12 + + movlpd p_x(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,(%rdi) + +.L__lnn12: + test $2,%r9d # second number? + jz .L__lnn1e + movlpd p_x+8(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,8(%rdi) + +.L__lnn1e: + jmp .L__vlog2n + + + .align 16 +.L__near_one2: + cmp $3,%r9d + jnz .L__n1nb2 + + movapd p_x2(%rsp),%xmm0 + call .Lboth_nearone + movapd %xmm0,%xmm7 + jmp .L__vlog4n + + .align 16 +.L__n1nb2: + test $1,%r9d + jz .L__lnn22 + + movlpd p_x2(%rsp),%xmm0 + call .L__ln1 + movsd %xmm0,%xmm7 + +.L__lnn22: + test $2,%r9d # second number? + jz .L__lnn2e + movlpd p_x2+8(%rsp),%xmm0 + call .L__ln1 + movlhps %xmm0,%xmm7 + +.L__lnn2e: + jmp .L__vlog4n + + .align 16 + +.L__ln1: +# saves 10 cycles +# r = x - 1.0; + movlpd .L__real_two(%rip),%xmm2 + subsd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addsd %xmm0,%xmm2 + movsd %xmm0,%xmm1 + divsd %xmm2,%xmm1 # u + movlpd .L__real_ca4(%rip),%xmm4 #D + movlpd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movsd %xmm0,%xmm6 + mulsd %xmm1,%xmm6 # correction +# u = u + u; + addsd %xmm1,%xmm1 #u + movsd %xmm1,%xmm2 + mulsd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulsd %xmm1,%xmm5 # Cu + movsd %xmm1,%xmm3 + mulsd %xmm2,%xmm3 # u^3 + mulsd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulsd %xmm3,%xmm4 #Du^3 + + addsd .L__real_ca1(%rip),%xmm2 # +A + movsd %xmm3,%xmm1 + mulsd %xmm1,%xmm1 # u^6 + addsd %xmm4,%xmm5 #Cu+Du3 + + mulsd %xmm3,%xmm2 #u3(A+Bu2) + mulsd %xmm5,%xmm1 #u6(Cu+Du3) + addsd %xmm1,%xmm2 + subsd %xmm6,%xmm2 # -correction + +# return r + r2; + addsd %xmm2,%xmm0 + ret + + .align 16 + +# at least one of the numbers was a nan or infinity +.L__log_naninf: + test $1,%r8d # first number? + jz .L__lninf2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rdx + movlpd p_x(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninf2: + test $2,%r8d # second number? + jz .L__lninfe + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rdx + movlpd p_x+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninfe: + jmp .L__vlog1 # continue processing if not + +# at least one of the numbers was a nan or infinity +.L__log_naninf2: + test $1,%r10d # first number? + jz .L__lninf22 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm7,%xmm1 # save the inputs + mov p_x2(%rsp),%rdx + movlpd p_x2(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm7,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + movapd %xmm0,%xmm7 + +.L__lninf22: + test $2,%r10d # second number? + jz .L__lninfe2 + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2+8(%rsp),%rdx + movlpd p_x2+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninfe2: + jmp .L__vlog3 # continue processing if not + +# a subroutine to treat one number for nan/infinity +# the number is expected in rdx and returned in the low +# half of xmm0 +.L__lni: + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__lnan # jump if mantissa not zero, so it's a NaN +# inf + rcl $1,%rdx + jnc .L__lne2 # log(+inf) = inf +# negative x + movlpd .L__real_nan(%rip),%xmm0 + ret + +#NaN +.L__lnan: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__lne: + movd %rdx,%xmm0 +.L__lne2: + ret + + .align 16 + +# at least one of the numbers was a zero, a negative number, or both. +.L__z_or_n: + test $1,%r9d # first number? + jz .L__zn2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rax + call .L__zni + shufpd $2,%xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zn2: + test $2,%r9d # second number? + jz .L__zne + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zne: + jmp .L__vlog2 + +.L__z_or_n2: + test $1,%r11d # first number? + jz .L__zn22 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2(%rsp),%rax + call .L__zni + shufpd $2,%xmm7,%xmm0 + movapd %xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zn22: + test $2,%r11d # second number? + jz .L__zne2 + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zne2: + jmp .L__vlog4 +# a subroutine to treat one number for zero or negative values +# the number is expected in rax and returned in the low +# half of xmm0 +.L__zni: + shl $1,%rax + jnz .L__zn_x # if just a carry, then must be negative + movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0 + ret +.L__zn_x: + movlpd .L__real_nan(%rip),%xmm0 + ret + + +# we jump here when we have an odd number of log calls to make at the +# end +# we assume that rdx is pointing at the next x array element, +# r8 at the next y array element. The number of values left is in +# save_nv +.L__vda_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__finish # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + +# fill in a m128d with zeroes and the extra values and then make a recursive call. + xorpd %xmm0,%xmm0 + movlpd %xmm0,p_x+8(%rsp) + movapd %xmm0,p_x+16(%rsp) + + mov (%rsi),%rcx # we know there's at least one + mov %rcx,p_x(%rsp) + cmp $2,%rax + jl .L__vdacg + + mov 8(%rsi),%rcx # do the second value + mov %rcx,p_x+8(%rsp) + cmp $3,%rax + jl .L__vdacg + + mov 16(%rsi),%rcx # do the third value + mov %rcx,p_x+16(%rsp) + +.L__vdacg: + mov $4,%rdi # parameter for N + lea p_x(%rsp),%rsi # &x parameter + lea p2_temp(%rsp),%rdx # &y parameter + call vrda_log@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + mov p2_temp(%rsp),%rcx + mov %rcx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vdacgf + + mov p2_temp+8(%rsp),%rcx + mov %rcx,8(%rdi) # do the second value + cmp $3,%rax + jl .L__vdacgf + + mov p2_temp+16(%rsp),%rcx + mov %rcx,16(%rdi) # do the third value + +.L__vdacgf: + jmp .L__finish + + .data + .align 64 + +.L__real_one: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_two: .quad 0x04000000000000000 # 2.0 + .quad 0x04000000000000000 +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0fff0000000000000 +.L__real_inf: .quad 0x07ff0000000000000 # +inf + .quad 0x07ff0000000000000 +.L__real_nan: .quad 0x07ff8000000000000 # NaN + .quad 0x07ff8000000000000 + +.L__real_zero: .quad 0x00000000000000000 # 0.0 + .quad 0x00000000000000000 + +.L__real_sign: .quad 0x08000000000000000 # sign bit + .quad 0x08000000000000000 +.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit + .quad 0x07ffFFFFFFFFFFFFF +.L__real_threshold: .quad 0x03F9EB85000000000 # .03 + .quad 0x03F9EB85000000000 +.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit + .quad 0x00008000000000000 +.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000FFFFFFFFFFFFF +.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */ + .quad 0x03f80000000000000 +.L__mask_1023: .quad 0x000000000000003ff # + .quad 0x000000000000003ff +.L__mask_040: .quad 0x00000000000000040 # + .quad 0x00000000000000040 +.L__mask_001: .quad 0x00000000000000001 # + .quad 0x00000000000000001 + +.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02 + .quad 0x03fb55555555554e6 +.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02 + .quad 0x03f89999999bac6d4 +.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03 + .quad 0x03f62492307f1519f +.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04 + .quad 0x03f3c8034c85dfff0 + +.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02 + .quad 0x03fb5555555555557 +.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02 + .quad 0x03f89999999865ede +.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03 + .quad 0x03f6249423bd94741 +.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01 + .quad 0x03fe62e42e0000000 +.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08 + .quad 0x03e6efa39ef35793c + +.L__real_half: .quad 0x03fe0000000000000 # 1/2 + .quad 0x03fe0000000000000 + + +.L__np_ln_lead_table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02 + .quad 0x3f9f829800000000 # 3.07716131210327148438e-02 + .quad 0x3fa7745800000000 # 4.58095073699951171875e-02 + .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02 + .quad 0x3fb341d700000000 # 7.52233862876892089844e-02 + .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02 + .quad 0x3fba926d00000000 # 1.03796780109405517578e-01 + .quad 0x3fbe270700000000 # 1.17783010005950927734e-01 + .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01 + .quad 0x3fc2955280000000 # 1.45181953907012939453e-01 + .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01 + .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01 + .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01 + .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01 + .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01 + .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01 + .quad 0x3fce270700000000 # 2.35566020011901855469e-01 + .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01 + .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01 + .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01 + .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01 + .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01 + .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01 + .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01 + .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01 + .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01 + .quad 0x3fd686c800000000 # 3.51976394653320312500e-01 + .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01 + .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01 + .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01 + .quad 0x3fd9479400000000 # 3.94993782043457031250e-01 + .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01 + .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01 + .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01 + .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01 + .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01 + .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01 + .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01 + .quad 0x3fde744240000000 # 4.75845873355865478516e-01 + .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01 + .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01 + .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01 + .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01 + .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01 + .quad 0x3fe109f380000000 # 5.32464742660522460938e-01 + .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01 + .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01 + .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01 + .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01 + .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01 + .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01 + .quad 0x3fe307d720000000 # 5.94707071781158447266e-01 + .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01 + .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01 + .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01 + .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01 + .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01 + .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01 + .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01 + .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01 + .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01 + .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01 + .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01 + .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_tail_table: + .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00 + .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09 + .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08 + .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08 + .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08 + .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08 + .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08 + .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08 + .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08 + .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08 + .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08 + .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08 + .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08 + .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10 + .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08 + .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08 + .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08 + .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08 + .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08 + .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08 + .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08 + .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08 + .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08 + .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08 + .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09 + .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09 + .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08 + .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08 + .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08 + .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08 + .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09 + .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08 + .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08 + .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08 + .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08 + .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08 + .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09 + .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08 + .quad 0x03e33071282fb989b # 4.43021445893361960146e-09 + .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08 + .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08 + .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08 + .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08 + .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08 + .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09 + .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08 + .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08 + .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08 + .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08 + .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08 + .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08 + .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08 + .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08 + .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08 + .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08 + .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08 + .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08 + .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09 + .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08 + .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08 + .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08 + .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08 + .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08 + .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08 + .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08 + .quad 0 # for alignment +
diff --git a/src/gas/vrdalog10.S b/src/gas/vrdalog10.S new file mode 100644 index 0000000..f766b62 --- /dev/null +++ b/src/gas/vrdalog10.S
@@ -0,0 +1,1021 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrdalog10.s +# +# An array implementation of the log10 libm function. +# +# Prototype: +# +# void vrda_log10(int n, double *x, double *y); +# +# Computes the natural log10 of x. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. This version can compute log10s in 50-55 +# cycles with n <= 24 +# +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + .weak vrda_log10_ + .set vrda_log10_,__vrda_log10__ + .weak vrda_log10__ + .set vrda_log10__,__vrda_log10__ + +# parameters are passed in by Linux as: +# rdi - int n +# rsi - double *x +# rdx - double *y + + .text + .align 16 + .p2align 4,,15 + +#x/* a FORTRAN subroutine implementation of array log10 +#** VRDA_LOG(N,X,Y) +# C equivalent*/ +#void vrda_log10__(int * n, double *x, double *y) +#{ +# vrda_log10(*n,x,y); +#} +.globl __vrda_log10__ + .type __vrda_log10__,@function +__vrda_log10__: + mov (%rdi),%edi + +# define local variable storage offsets +.equ p_x,0 # temporary for error checking operation +.equ p_idx,0x010 # index storage +.equ p_xexp,0x020 # index storage + +.equ p_x2,0x030 # temporary for error checking operation +.equ p_idx2,0x040 # index storage +.equ p_xexp2,0x050 # index storage + +.equ save_xa,0x060 #qword +.equ save_ya,0x068 #qword +.equ save_nv,0x070 #qword +.equ p_iter,0x078 # qword storage for number of loop iterations + +.equ save_rbx,0x080 #qword + + +.equ p2_temp,0x090 # second temporary for get/put bits operation +.equ p2_temp1,0x0b0 # second temporary for exponent multiply + +.equ p_n1,0x0c0 # temporary for near one check +.equ p_n12,0x0d0 # temporary for near one check + + +.equ stack_size,0x0e8 + + +# parameters are passed in by Microsoft C as: +# rdi - int n +# rsi - double *x +# rdx - double *y + + .text + .align 16 + .p2align 4,,15 +.globl vrda_log10 + .type vrda_log10,@function +vrda_log10: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx + +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + + mov %rdi,save_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vda_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +# In this second version, process the array 2 values at a time. + +.L__vda_top: +# build the input _m128d + mov save_xa(%rsp),%rsi # get x_array pointer + movlpd (%rsi),%xmm0 + movhpd 8(%rsi),%xmm0 + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + movlpd -16(%rsi),%xmm7 + movhpd -8(%rsi),%xmm7 + +# compute the log10s + +## if NaN or inf + movdqa %xmm0,p_x(%rsp) # save the input values + +# /* Store the exponent of x in xexp and put +# f into the range [0.5,1) */ + + pxor %xmm1,%xmm1 + movdqa %xmm0,%xmm3 + psrlq $52,%xmm3 + psubq .L__mask_1023(%rip),%xmm3 + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm6 # xexp + movdqa %xmm7,p_x2(%rsp) # save the input values + movdqa %xmm0,%xmm2 + subpd .L__real_one(%rip),%xmm2 + + movapd %xmm6,p_xexp(%rsp) + andpd .L__real_notsign(%rip),%xmm2 + xor %rax,%rax + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + + cmppd $1,.L__real_threshold(%rip),%xmm2 + movmskpd %xmm2,%ecx + movdqa %xmm3,%xmm4 + mov %ecx,p_n1(%rsp) + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrlq $45,%xmm3 + movdqa %xmm3,%xmm2 + psrlq $1,%xmm3 + paddq .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm2 + paddq %xmm2,%xmm3 + + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm1 + pxor %xmm7,%xmm7 + movdqa p_x2(%rsp),%xmm2 + movapd p_x2(%rsp),%xmm5 + psrlq $52,%xmm2 + psubq .L__mask_1023(%rip),%xmm2 + packssdw %xmm7,%xmm2 + subpd .L__real_one(%rip),%xmm5 + andpd .L__real_notsign(%rip),%xmm5 + cvtdq2pd %xmm2,%xmm6 # xexp + xor %rcx,%rcx + cmppd $1,.L__real_threshold(%rip),%xmm5 + movq %xmm3,p_idx(%rsp) + +# reduce and get u + por .L__real_half(%rip),%xmm4 + movdqa %xmm4,%xmm2 + movapd %xmm6,p_xexp2(%rsp) + + # do near one check + movmskpd %xmm5,%edx + mov %edx,p_n12(%rsp) + + mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128 + + + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx(%rsp),%eax + movdqa p_x2(%rsp),%xmm6 + + movapd .L__real_half(%rip),%xmm5 # .5 + subpd %xmm1,%xmm2 # f2 = f - f1 + pand .L__real_mant(%rip),%xmm6 + mulpd %xmm2,%xmm5 + addpd %xmm5,%xmm1 + + movdqa %xmm6,%xmm8 + psrlq $45,%xmm6 + movdqa %xmm6,%xmm4 + + psrlq $1,%xmm6 + paddq .L__mask_040(%rip),%xmm6 + pand .L__mask_001(%rip),%xmm4 + paddq %xmm4,%xmm6 +# do error checking here for scheduling. Saves a bunch of cycles as +# compared to doing this at the start of the routine. +## if NaN or inf + movapd %xmm0,%xmm3 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r8d + packssdw %xmm7,%xmm6 + por .L__real_half(%rip),%xmm8 + movq %xmm6,p_idx2(%rsp) + cvtdq2pd %xmm6,%xmm9 + + cmppd $2,.L__real_zero(%rip),%xmm0 + mulpd .L__real_3f80000000000000(%rip),%xmm9 # f1 = index/128 + movmskpd %xmm0,%r9d +# delaying this divide helps, but moving the other one does not. +# it was after the paddq + divpd %xmm1,%xmm2 # u + +# compute the index into the log10 tables +# + + movlpd -512(%rdx,%rax,8),%xmm0 # z1 + mov p_idx+4(%rsp),%ecx + movhpd -512(%rdx,%rcx,8),%xmm0 # z1 +# solve for ln(1+u) + movapd %xmm2,%xmm1 # u + mulpd %xmm2,%xmm2 # u^2 + movapd %xmm2,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm2,%xmm3 #Cu2 + mulpd %xmm1,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm2 # u^5 + movapd .L__real_log2_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm1 # u+Au3 + mulpd %xmm3,%xmm2 # u5(B+Cu2) + + movapd p_xexp(%rsp),%xmm5 # xexp + addpd %xmm2,%xmm1 # poly +# recombine + mulpd %xmm5,%xmm4 # xexp * log2_lead + addpd %xmm4,%xmm0 #r1 + movapd %xmm0,%xmm2 #for log10 + lea .L__np_ln_tail_table(%rip),%rdx + movlpd -512(%rdx,%rax,8),%xmm4 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm4 #z2 +=q + mulpd .L__real_log10e_tail(%rip),%xmm0 #for log10 + mulpd .L__real_log10e_lead(%rip),%xmm2 #for log10 + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx2(%rsp),%eax + mov p_idx2+4(%rsp),%ecx + addpd %xmm4,%xmm1 + + mulpd .L__real_log2_tail(%rip),%xmm5 + + movapd .L__real_half(%rip),%xmm4 # .5 + subpd %xmm9,%xmm8 # f2 = f - f1 + mulpd %xmm8,%xmm4 + addpd %xmm4,%xmm9 + + addpd %xmm5,%xmm1 #r2 + movapd %xmm1,%xmm7 #for log10 + mulpd .L__real_log10e_tail(%rip),%xmm1 #for log10 + addpd %xmm1,%xmm0 #for log10 + + divpd %xmm9,%xmm8 # u + movapd p_x2(%rsp),%xmm3 + mulpd .L__real_log10e_lead(%rip),%xmm7 #log10 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r10d + addpd %xmm7,%xmm0 #for log10 + movapd p_x2(%rsp),%xmm6 + cmppd $2,.L__real_zero(%rip),%xmm6 + movmskpd %xmm6,%r11d + +# check for nans/infs + test $3,%r8d + addpd %xmm2,%xmm0 #for log10 +# addpd %xmm1,%xmm0 + jnz .L__log_naninf +.L__vlog1: +# check for negative numbers or zero + test $3,%r9d + jnz .L__z_or_n + +.L__vlog2: +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + movlpd %xmm0,(%rdi) + movhpd %xmm0,8(%rdi) + + # It seems like a good idea to try and interleave + # even more of the following code sooner into the + # program. But there were conflicts with the table + # index registers, making the problem difficult. + # After a lot of work in a branch of this file, + # I was not able to match the speed of this version. + # CodeAnalyst shows that there is lots of unused add + # pipe time around the divides, but the processor + # doesn't seem to be able to schedule in those slots. + + movlpd -512(%rdx,%rax,8),%xmm7 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm7 #z2 +=q + +# check for near one + mov p_n1(%rsp),%r9d + test $3,%r9d + jnz .L__near_one1 +.L__vlog2n: + + # solve for ln(1+u) + movapd %xmm8,%xmm9 # u + mulpd %xmm8,%xmm8 # u^2 + movapd %xmm8,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm8,%xmm3 #Cu2 + mulpd %xmm9,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm8 # u^5 + movapd .L__real_log2_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm9 # u+Au3 + mulpd %xmm3,%xmm8 # u5(B+Cu2) + + movapd p_xexp2(%rsp),%xmm5 # xexp + addpd %xmm8,%xmm9 # poly + # recombine + mulpd %xmm5,%xmm4 + addpd %xmm4,%xmm7 #r1 + movapd %xmm7,%xmm6 #for log10 + + lea .L__np_ln_tail_table(%rip),%rdx + mulpd .L__real_log10e_tail(%rip),%xmm7 #for log10 + movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q + mulpd .L__real_log10e_lead(%rip),%xmm6 #for log10 + movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q + addpd %xmm2,%xmm9 + + mulpd .L__real_log2_tail(%rip),%xmm5 + + addpd %xmm5,%xmm9 #r2 + movapd %xmm9,%xmm8 #for log10 + mulpd .L__real_log10e_tail(%rip),%xmm9 #for log 10 + addpd %xmm9,%xmm7 #for log10 + mulpd .L__real_log10e_lead(%rip),%xmm8 #for log10 + addpd %xmm8,%xmm7 #for log10 + + # check for nans/infs + test $3,%r10d + addpd %xmm6,%xmm7 #for log10 +# addpd %xmm9,%xmm7 + jnz .L__log_naninf2 +.L__vlog3: +# check for negative numbers or zero + test $3,%r11d + jnz .L__z_or_n2 + +.L__vlog4: + mov p_n12(%rsp),%r9d + test $3,%r9d + jnz .L__near_one2 + +.L__vlog4n: + + +#__vda_bottom2: + + prefetch 64(%rdi) + add $32,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + +# store the result _m128d + movlpd %xmm7,-16(%rdi) + movhpd %xmm7,-8(%rdi) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vda_top + + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vda_cleanup + + +.L__finish: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + + .align 16 +.Lboth_nearone: +# saves 10 cycles +# r = x - 1.0; + movapd .L__real_two(%rip),%xmm2 + subpd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addpd %xmm0,%xmm2 + movapd %xmm0,%xmm1 + divpd %xmm2,%xmm1 # u + movapd .L__real_ca4(%rip),%xmm4 #D + movapd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movapd %xmm0,%xmm6 + mulpd %xmm1,%xmm6 # correction +# u = u + u; + addpd %xmm1,%xmm1 #u + movapd %xmm1,%xmm2 + mulpd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulpd %xmm1,%xmm5 # Cu + movapd %xmm1,%xmm3 + mulpd %xmm2,%xmm3 # u^3 + mulpd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulpd %xmm3,%xmm4 #Du^3 + + addpd .L__real_ca1(%rip),%xmm2 # +A + movapd %xmm3,%xmm1 + mulpd %xmm1,%xmm1 # u^6 + addpd %xmm4,%xmm5 #Cu+Du3 + + mulpd %xmm3,%xmm2 #u3(A+Bu2) + mulpd %xmm5,%xmm1 #u6(Cu+Du3) + addpd %xmm1,%xmm2 + subpd %xmm6,%xmm2 # -correction + +# loge to log10 + movapd %xmm0,%xmm3 #r1 = r + pand .L__mask_lower(%rip),%xmm3 + subpd %xmm3,%xmm0 + addpd %xmm0,%xmm2 #r2 = r2 + (r - r1); + + movapd %xmm3,%xmm0 + movapd %xmm2,%xmm1 + + mulpd .L__real_log10e_tail(%rip),%xmm2 + mulpd .L__real_log10e_tail(%rip),%xmm0 + mulpd .L__real_log10e_lead(%rip),%xmm1 + mulpd .L__real_log10e_lead(%rip),%xmm3 + addpd %xmm2,%xmm0 + addpd %xmm1,%xmm0 + addpd %xmm3,%xmm0 + +# return r + r2; +# addpd %xmm2,%xmm0 + ret + + .align 16 +.L__near_one1: + cmp $3,%r9d + jnz .L__n1nb1 + + movapd p_x(%rsp),%xmm0 + call .Lboth_nearone + movlpd %xmm0,(%rdi) + movhpd %xmm0,8(%rdi) + jmp .L__vlog2n + + .align 16 +.L__n1nb1: + test $1,%r9d + jz .L__lnn12 + + movlpd p_x(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,(%rdi) + +.L__lnn12: + test $2,%r9d # second number? + jz .L__lnn1e + movlpd p_x+8(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,8(%rdi) + +.L__lnn1e: + jmp .L__vlog2n + + + .align 16 +.L__near_one2: + cmp $3,%r9d + jnz .L__n1nb2 + + movapd p_x2(%rsp),%xmm0 + call .Lboth_nearone + movapd %xmm0,%xmm7 + jmp .L__vlog4n + + .align 16 +.L__n1nb2: + test $1,%r9d + jz .L__lnn22 + + movlpd p_x2(%rsp),%xmm0 + call .L__ln1 + movsd %xmm0,%xmm7 + +.L__lnn22: + test $2,%r9d # second number? + jz .L__lnn2e + movlpd p_x2+8(%rsp),%xmm0 + call .L__ln1 + movlhps %xmm0,%xmm7 + +.L__lnn2e: + jmp .L__vlog4n + + .align 16 + +.L__ln1: +# saves 10 cycles +# r = x - 1.0; + movlpd .L__real_two(%rip),%xmm2 + subsd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addsd %xmm0,%xmm2 + movsd %xmm0,%xmm1 + divsd %xmm2,%xmm1 # u + movlpd .L__real_ca4(%rip),%xmm4 #D + movlpd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movsd %xmm0,%xmm6 + mulsd %xmm1,%xmm6 # correction +# u = u + u; + addsd %xmm1,%xmm1 #u + movsd %xmm1,%xmm2 + mulsd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulsd %xmm1,%xmm5 # Cu + movsd %xmm1,%xmm3 + mulsd %xmm2,%xmm3 # u^3 + mulsd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulsd %xmm3,%xmm4 #Du^3 + + addsd .L__real_ca1(%rip),%xmm2 # +A + movsd %xmm3,%xmm1 + mulsd %xmm1,%xmm1 # u^6 + addsd %xmm4,%xmm5 #Cu+Du3 + + mulsd %xmm3,%xmm2 #u3(A+Bu2) + mulsd %xmm5,%xmm1 #u6(Cu+Du3) + addsd %xmm1,%xmm2 + subsd %xmm6,%xmm2 # -correction + +# loge to log10 + movsd %xmm0,%xmm3 #r1 = r + pand .L__mask_lower(%rip),%xmm3 + subsd %xmm3,%xmm0 + addsd %xmm0,%xmm2 #r2 = r2 + (r - r1); + + movsd %xmm3,%xmm0 + movsd %xmm2,%xmm1 + + mulsd .L__real_log10e_tail(%rip),%xmm2 + mulsd .L__real_log10e_tail(%rip),%xmm0 + mulsd .L__real_log10e_lead(%rip),%xmm1 + mulsd .L__real_log10e_lead(%rip),%xmm3 + addsd %xmm2,%xmm0 + addsd %xmm1,%xmm0 + addsd %xmm3,%xmm0 + +# return r + r2; +# addsd %xmm2,%xmm0 + ret + + .align 16 + +# at least one of the numbers was a nan or infinity +.L__log_naninf: + test $1,%r8d # first number? + jz .L__lninf2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rdx + movlpd p_x(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninf2: + test $2,%r8d # second number? + jz .L__lninfe + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rdx + movlpd p_x+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninfe: + jmp .L__vlog1 # continue processing if not + +# at least one of the numbers was a nan or infinity +.L__log_naninf2: + test $1,%r10d # first number? + jz .L__lninf22 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm7,%xmm1 # save the inputs + mov p_x2(%rsp),%rdx + movlpd p_x2(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm7,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + movapd %xmm0,%xmm7 + +.L__lninf22: + test $2,%r10d # second number? + jz .L__lninfe2 + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2+8(%rsp),%rdx + movlpd p_x2+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninfe2: + jmp .L__vlog3 # continue processing if not + +# a subroutine to treat one number for nan/infinity +# the number is expected in rdx and returned in the low +# half of xmm0 +.L__lni: + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__lnan # jump if mantissa not zero, so it's a NaN +# inf + rcl $1,%rdx + jnc .L__lne2 # log(+inf) = inf +# negative x + movlpd .L__real_nan(%rip),%xmm0 + ret + +#NaN +.L__lnan: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__lne: + movd %rdx,%xmm0 +.L__lne2: + ret + + .align 16 + +# at least one of the numbers was a zero, a negative number, or both. +.L__z_or_n: + test $1,%r9d # first number? + jz .L__zn2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rax + call .L__zni + shufpd $2,%xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zn2: + test $2,%r9d # second number? + jz .L__zne + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zne: + jmp .L__vlog2 + +.L__z_or_n2: + test $1,%r11d # first number? + jz .L__zn22 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2(%rsp),%rax + call .L__zni + shufpd $2,%xmm7,%xmm0 + movapd %xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zn22: + test $2,%r11d # second number? + jz .L__zne2 + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zne2: + jmp .L__vlog4 +# a subroutine to treat one number for zero or negative values +# the number is expected in rax and returned in the low +# half of xmm0 +.L__zni: + shl $1,%rax + jnz .L__zn_x # if just a carry, then must be negative + movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0 + ret +.L__zn_x: + movlpd .L__real_nan(%rip),%xmm0 + ret + + +# we jump here when we have an odd number of log calls to make at the +# end +# we assume that rdx is pointing at the next x array element, +# r8 at the next y array element. The number of values left is in +# save_nv +.L__vda_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__finish # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + +# fill in a m128d with zeroes and the extra values and then make a recursive call. + xorpd %xmm0,%xmm0 + movlpd %xmm0,p_x+8(%rsp) + movapd %xmm0,p_x+16(%rsp) + + mov (%rsi),%rcx # we know there's at least one + mov %rcx,p_x(%rsp) + cmp $2,%rax + jl .L__vdacg + + mov 8(%rsi),%rcx # do the second value + mov %rcx,p_x+8(%rsp) + cmp $3,%rax + jl .L__vdacg + + mov 16(%rsi),%rcx # do the third value + mov %rcx,p_x+16(%rsp) + +.L__vdacg: + mov $4,%rdi # parameter for N + lea p_x(%rsp),%rsi # &x parameter + lea p2_temp(%rsp),%rdx # &y parameter + call vrda_log10@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + mov p2_temp(%rsp),%rcx + mov %rcx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vdacgf + + mov p2_temp+8(%rsp),%rcx + mov %rcx,8(%rdi) # do the second value + cmp $3,%rax + jl .L__vdacgf + + mov p2_temp+16(%rsp),%rcx + mov %rcx,16(%rdi) # do the third value + +.L__vdacgf: + jmp .L__finish + + .data + .align 64 + +.L__real_one: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_two: .quad 0x04000000000000000 # 2.0 + .quad 0x04000000000000000 +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0fff0000000000000 +.L__real_inf: .quad 0x07ff0000000000000 # +inf + .quad 0x07ff0000000000000 +.L__real_nan: .quad 0x07ff8000000000000 # NaN + .quad 0x07ff8000000000000 + +.L__real_zero: .quad 0x00000000000000000 # 0.0 + .quad 0x00000000000000000 + +.L__real_sign: .quad 0x08000000000000000 # sign bit + .quad 0x08000000000000000 +.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit + .quad 0x07ffFFFFFFFFFFFFF +.L__real_threshold: .quad 0x03FB082C000000000 # .064495086669921875 Threshold + .quad 0x03FB082C000000000 +.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit + .quad 0x00008000000000000 +.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000FFFFFFFFFFFFF +.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */ + .quad 0x03f80000000000000 +.L__mask_1023: .quad 0x000000000000003ff # + .quad 0x000000000000003ff +.L__mask_040: .quad 0x00000000000000040 # + .quad 0x00000000000000040 +.L__mask_001: .quad 0x00000000000000001 # + .quad 0x00000000000000001 + +.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02 + .quad 0x03fb55555555554e6 +.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02 + .quad 0x03f89999999bac6d4 +.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03 + .quad 0x03f62492307f1519f +.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04 + .quad 0x03f3c8034c85dfff0 + +.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02 + .quad 0x03fb5555555555557 +.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02 + .quad 0x03f89999999865ede +.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03 + .quad 0x03f6249423bd94741 +.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01 + .quad 0x03fe62e42e0000000 +.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08 + .quad 0x03e6efa39ef35793c + +.L__real_half: .quad 0x03fe0000000000000 # 1/2 + .quad 0x03fe0000000000000 + +.L__real_log10e_lead: .quad 0x03fdbcb7800000000 # log10e_lead 4.34293746948242187500e-01 + .quad 0x03fdbcb7800000000 +.L__real_log10e_tail: .quad 0x03ea8a93728719535 # log10e_tail 7.3495500964015109100644e-7 + .quad 0x03ea8a93728719535 + +.L__mask_lower: .quad 0x0ffffffff00000000 + .quad 0x0ffffffff00000000 + +.L__np_ln_lead_table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02 + .quad 0x3f9f829800000000 # 3.07716131210327148438e-02 + .quad 0x3fa7745800000000 # 4.58095073699951171875e-02 + .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02 + .quad 0x3fb341d700000000 # 7.52233862876892089844e-02 + .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02 + .quad 0x3fba926d00000000 # 1.03796780109405517578e-01 + .quad 0x3fbe270700000000 # 1.17783010005950927734e-01 + .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01 + .quad 0x3fc2955280000000 # 1.45181953907012939453e-01 + .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01 + .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01 + .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01 + .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01 + .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01 + .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01 + .quad 0x3fce270700000000 # 2.35566020011901855469e-01 + .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01 + .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01 + .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01 + .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01 + .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01 + .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01 + .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01 + .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01 + .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01 + .quad 0x3fd686c800000000 # 3.51976394653320312500e-01 + .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01 + .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01 + .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01 + .quad 0x3fd9479400000000 # 3.94993782043457031250e-01 + .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01 + .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01 + .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01 + .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01 + .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01 + .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01 + .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01 + .quad 0x3fde744240000000 # 4.75845873355865478516e-01 + .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01 + .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01 + .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01 + .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01 + .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01 + .quad 0x3fe109f380000000 # 5.32464742660522460938e-01 + .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01 + .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01 + .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01 + .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01 + .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01 + .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01 + .quad 0x3fe307d720000000 # 5.94707071781158447266e-01 + .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01 + .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01 + .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01 + .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01 + .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01 + .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01 + .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01 + .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01 + .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01 + .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01 + .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01 + .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_tail_table: + .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00 + .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09 + .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08 + .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08 + .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08 + .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08 + .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08 + .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08 + .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08 + .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08 + .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08 + .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08 + .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08 + .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10 + .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08 + .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08 + .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08 + .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08 + .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08 + .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08 + .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08 + .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08 + .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08 + .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08 + .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09 + .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09 + .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08 + .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08 + .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08 + .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08 + .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09 + .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08 + .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08 + .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08 + .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08 + .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08 + .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09 + .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08 + .quad 0x03e33071282fb989b # 4.43021445893361960146e-09 + .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08 + .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08 + .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08 + .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08 + .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08 + .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09 + .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08 + .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08 + .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08 + .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08 + .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08 + .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08 + .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08 + .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08 + .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08 + .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08 + .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08 + .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08 + .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09 + .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08 + .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08 + .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08 + .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08 + .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08 + .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08 + .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08 + .quad 0 # for alignment +
diff --git a/src/gas/vrdalog2.S b/src/gas/vrdalog2.S new file mode 100644 index 0000000..0200f03 --- /dev/null +++ b/src/gas/vrdalog2.S
@@ -0,0 +1,1003 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrdalog.s +# +# An array implementation of the log libm function. +# +# Prototype: +# +# void vrda_log2(int n, double *x, double *y); +# +# Computes the natural log of x. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. This version can compute logs in 44 +# cycles with n <= 24 +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# define local variable storage offsets +.equ p_x,0 # temporary for error checking operation +.equ p_idx,0x010 # index storage +.equ p_xexp,0x020 # index storage + +.equ p_x2,0x030 # temporary for error checking operation +.equ p_idx2,0x040 # index storage +.equ p_xexp2,0x050 # index storage + +.equ save_xa,0x060 #qword +.equ save_ya,0x068 #qword +.equ save_nv,0x070 #qword +.equ p_iter,0x078 # qword storage for number of loop iterations + +.equ save_rbx,0x080 #qword + + +.equ p2_temp,0x090 # second temporary for get/put bits operation +.equ p2_temp1,0x0b0 # second temporary for exponent multiply + +.equ p_n1,0x0c0 # temporary for near one check +.equ p_n12,0x0d0 # temporary for near one check + + +.equ stack_size,0x0e8 + + .weak vrda_log2_ + .set vrda_log2_,__vrda_log2__ + .weak vrda_log2__ + .set vrda_log2__,__vrda_log2__ + +# parameters are passed in by Linux as: +# rdi - int n +# rsi - double *x +# rdx - double *y + + .text + .align 16 + .p2align 4,,15 + +#x/* a FORTRAN subroutine implementation of array log +#** VRDA_LOG(N,X,Y) +# C equivalent*/ +#void vrda_log2__(int * n, double *x, double *y) +#{ +# vrda_log2(*n,x,y); +#} +.globl __vrda_log2__ + .type __vrda_log2__,@function +__vrda_log2__: + mov (%rdi),%edi + + .align 16 + .p2align 4,,15 +.globl vrda_log2 + .type vrda_log2,@function +vrda_log2: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx + +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + + mov %rdi,save_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vda_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +# In this second version, process the array 2 values at a time. + +.L__vda_top: +# build the input _m128d + mov save_xa(%rsp),%rsi # get x_array pointer + movlpd (%rsi),%xmm0 + movhpd 8(%rsi),%xmm0 + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + movlpd -16(%rsi),%xmm7 + movhpd -8(%rsi),%xmm7 + +# compute the logs + +## if NaN or inf + movdqa %xmm0,p_x(%rsp) # save the input values + +# /* Store the exponent of x in xexp and put +# f into the range [0.5,1) */ + + pxor %xmm1,%xmm1 + movdqa %xmm0,%xmm3 + psrlq $52,%xmm3 + psubq .L__mask_1023(%rip),%xmm3 + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm6 # xexp + movdqa %xmm7,p_x2(%rsp) # save the input values + movdqa %xmm0,%xmm2 + subpd .L__real_one(%rip),%xmm2 + + movapd %xmm6,p_xexp(%rsp) + andpd .L__real_notsign(%rip),%xmm2 + xor %rax,%rax + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + + cmppd $1,.L__real_threshold(%rip),%xmm2 + movmskpd %xmm2,%ecx + movdqa %xmm3,%xmm4 + mov %ecx,p_n1(%rsp) + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrlq $45,%xmm3 + movdqa %xmm3,%xmm2 + psrlq $1,%xmm3 + paddq .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm2 + paddq %xmm2,%xmm3 + + packssdw %xmm1,%xmm3 + cvtdq2pd %xmm3,%xmm1 + pxor %xmm7,%xmm7 + movdqa p_x2(%rsp),%xmm2 + movapd p_x2(%rsp),%xmm5 + psrlq $52,%xmm2 + psubq .L__mask_1023(%rip),%xmm2 + packssdw %xmm7,%xmm2 + subpd .L__real_one(%rip),%xmm5 + andpd .L__real_notsign(%rip),%xmm5 + cvtdq2pd %xmm2,%xmm6 # xexp + xor %rcx,%rcx + cmppd $1,.L__real_threshold(%rip),%xmm5 + movq %xmm3,p_idx(%rsp) + +# reduce and get u + por .L__real_half(%rip),%xmm4 + movdqa %xmm4,%xmm2 + movapd %xmm6,p_xexp2(%rsp) + + # do near one check + movmskpd %xmm5,%edx + mov %edx,p_n12(%rsp) + + mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128 + + + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx(%rsp),%eax + movdqa p_x2(%rsp),%xmm6 + + movapd .L__real_half(%rip),%xmm5 # .5 + subpd %xmm1,%xmm2 # f2 = f - f1 + pand .L__real_mant(%rip),%xmm6 + mulpd %xmm2,%xmm5 + addpd %xmm5,%xmm1 + + movdqa %xmm6,%xmm8 + psrlq $45,%xmm6 + movdqa %xmm6,%xmm4 + + psrlq $1,%xmm6 + paddq .L__mask_040(%rip),%xmm6 + pand .L__mask_001(%rip),%xmm4 + paddq %xmm4,%xmm6 +# do error checking here for scheduling. Saves a bunch of cycles as +# compared to doing this at the start of the routine. +## if NaN or inf + movapd %xmm0,%xmm3 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r8d + packssdw %xmm7,%xmm6 + por .L__real_half(%rip),%xmm8 + movq %xmm6,p_idx2(%rsp) + cvtdq2pd %xmm6,%xmm9 + + cmppd $2,.L__real_zero(%rip),%xmm0 + mulpd .L__real_3f80000000000000(%rip),%xmm9 # f1 = index/128 + movmskpd %xmm0,%r9d +# delaying this divide helps, but moving the other one does not. +# it was after the paddq + divpd %xmm1,%xmm2 # u + +# compute the index into the log tables +# + + movlpd -512(%rdx,%rax,8),%xmm0 # z1 + mov p_idx+4(%rsp),%ecx + movhpd -512(%rdx,%rcx,8),%xmm0 # z1 +# solve for ln(1+u) + movapd %xmm2,%xmm1 # u + mulpd %xmm2,%xmm2 # u^2 + movapd %xmm2,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm2,%xmm3 #Cu2 + mulpd %xmm1,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm2 # u^5 + movapd .L__real_log2e_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm1 # u+Au3 + movapd %xmm0,%xmm5 #z1 copy + mulpd %xmm3,%xmm2 # u5(B+Cu2) + movapd .L__real_log2e_tail(%rip),%xmm3 + movapd p_xexp(%rsp),%xmm6 # xexp + addpd %xmm2,%xmm1 # poly +# recombine + lea .L__np_ln_tail_table(%rip),%rdx + movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx2(%rsp),%eax + mov p_idx2+4(%rsp),%ecx + addpd %xmm2,%xmm1 #z2 + movapd %xmm1,%xmm2 #z2 copy + + + mulpd %xmm4,%xmm5 + mulpd %xmm4,%xmm1 + movapd .L__real_half(%rip),%xmm4 # .5 + subpd %xmm9,%xmm8 # f2 = f - f1 + mulpd %xmm8,%xmm4 + addpd %xmm4,%xmm9 + mulpd %xmm3,%xmm2 #z2*log2e_tail + mulpd %xmm3,%xmm0 #z1*log2e_tail + addpd %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp + addpd %xmm2,%xmm0 #z1*log2e_tail + z2*log2e_tail + addpd %xmm1,%xmm0 #r2 + + divpd %xmm9,%xmm8 # u + movapd p_x2(%rsp),%xmm3 + andpd .L__real_inf(%rip),%xmm3 + cmppd $0,.L__real_inf(%rip),%xmm3 + movmskpd %xmm3,%r10d + movapd p_x2(%rsp),%xmm6 + cmppd $2,.L__real_zero(%rip),%xmm6 + movmskpd %xmm6,%r11d + +# check for nans/infs + test $3,%r8d + addpd %xmm5,%xmm0 + jnz .L__log_naninf +.L__vlog1: +# check for negative numbers or zero + test $3,%r9d + jnz .L__z_or_n + +.L__vlog2: +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + movlpd %xmm0,(%rdi) + movhpd %xmm0,8(%rdi) + + # It seems like a good idea to try and interleave + # even more of the following code sooner into the + # program. But there were conflicts with the table + # index registers, making the problem difficult. + # After a lot of work in a branch of this file, + # I was not able to match the speed of this version. + # CodeAnalyst shows that there is lots of unused add + # pipe time around the divides, but the processor + # doesn't seem to be able to schedule in those slots. + + movlpd -512(%rdx,%rax,8),%xmm7 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm7 #z2 +=q + +# check for near one + mov p_n1(%rsp),%r9d + test $3,%r9d + jnz .L__near_one1 +.L__vlog2n: + + # solve for ln(1+u) + movapd %xmm8,%xmm9 # u + mulpd %xmm8,%xmm8 # u^2 + movapd %xmm8,%xmm5 + movapd .L__real_cb3(%rip),%xmm3 + mulpd %xmm8,%xmm3 #Cu2 + mulpd %xmm9,%xmm5 # u^3 + addpd .L__real_cb2(%rip),%xmm3 #B+Cu2 + + mulpd %xmm5,%xmm8 # u^5 + movapd .L__real_log2e_lead(%rip),%xmm4 + + mulpd .L__real_cb1(%rip),%xmm5 #Au3 + addpd %xmm5,%xmm9 # u+Au3 + movapd %xmm7,%xmm5 #z1 copy + mulpd %xmm3,%xmm8 # u5(B+Cu2) + movapd .L__real_log2e_tail(%rip),%xmm3 + movapd p_xexp2(%rsp),%xmm6 # xexp + addpd %xmm8,%xmm9 # poly + # recombine + lea .L__np_ln_tail_table(%rip),%rdx + movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q + movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q + addpd %xmm2,%xmm9 #z2 + movapd %xmm9,%xmm2 #z2 copy + + mulpd %xmm4,%xmm5 #z1*log2e_lead + mulpd %xmm4,%xmm9 #z2*log2e_lead + mulpd %xmm3,%xmm2 #z2*log2e_tail + mulpd %xmm3,%xmm7 #z1*log2e_tail + addpd %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp + addpd %xmm2,%xmm7 #z1*log2e_tail + z2*log2e_tail + + + addpd %xmm9,%xmm7 #r2 + + # check for nans/infs + test $3,%r10d + addpd %xmm5,%xmm7 + jnz .L__log_naninf2 +.L__vlog3: +# check for negative numbers or zero + test $3,%r11d + jnz .L__z_or_n2 + +.L__vlog4: + mov p_n12(%rsp),%r9d + test $3,%r9d + jnz .L__near_one2 + +.L__vlog4n: + + +#__vda_bottom2: + + prefetch 64(%rdi) + add $32,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + +# store the result _m128d + movlpd %xmm7,-16(%rdi) + movhpd %xmm7,-8(%rdi) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vda_top + + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vda_cleanup + + +.L__finish: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + + .align 16 +.Lboth_nearone: +# saves 10 cycles +# r = x - 1.0; + movapd .L__real_two(%rip),%xmm2 + subpd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addpd %xmm0,%xmm2 + movapd %xmm0,%xmm1 + divpd %xmm2,%xmm1 # u + movapd .L__real_ca4(%rip),%xmm4 #D + movapd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movapd %xmm0,%xmm6 + mulpd %xmm1,%xmm6 # correction +# u = u + u; + addpd %xmm1,%xmm1 #u + movapd %xmm1,%xmm2 + mulpd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulpd %xmm1,%xmm5 # Cu + movapd %xmm1,%xmm3 + mulpd %xmm2,%xmm3 # u^3 + mulpd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulpd %xmm3,%xmm4 #Du^3 + + addpd .L__real_ca1(%rip),%xmm2 # +A + movapd %xmm3,%xmm1 + mulpd %xmm1,%xmm1 # u^6 + addpd %xmm4,%xmm5 #Cu+Du3 + + mulpd %xmm3,%xmm2 #u3(A+Bu2) + mulpd %xmm5,%xmm1 #u6(Cu+Du3) + addpd %xmm1,%xmm2 + subpd %xmm6,%xmm2 # -correction + +# loge to log2 + movapd %xmm0,%xmm3 #r1 = r + pand .L__mask_lower(%rip),%xmm3 + subpd %xmm3,%xmm0 + addpd %xmm0,%xmm2 #r2 = r2 + (r - r1); + + movapd %xmm3,%xmm0 + movapd %xmm2,%xmm1 + + mulpd .L__real_log2e_tail(%rip),%xmm2 + mulpd .L__real_log2e_tail(%rip),%xmm0 + mulpd .L__real_log2e_lead(%rip),%xmm1 + mulpd .L__real_log2e_lead(%rip),%xmm3 + addpd %xmm2,%xmm0 + addpd %xmm1,%xmm0 + addpd %xmm3,%xmm0 +# return r + r2; +# addpd %xmm2,%xmm0 + ret + + .align 16 +.L__near_one1: + cmp $3,%r9d + jnz .L__n1nb1 + + movapd p_x(%rsp),%xmm0 + call .Lboth_nearone + movlpd %xmm0,(%rdi) + movhpd %xmm0,8(%rdi) + jmp .L__vlog2n + + .align 16 +.L__n1nb1: + test $1,%r9d + jz .L__lnn12 + + movlpd p_x(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,(%rdi) + +.L__lnn12: + test $2,%r9d # second number? + jz .L__lnn1e + movlpd p_x+8(%rsp),%xmm0 + call .L__ln1 + movlpd %xmm0,8(%rdi) + +.L__lnn1e: + jmp .L__vlog2n + + + .align 16 +.L__near_one2: + cmp $3,%r9d + jnz .L__n1nb2 + + movapd p_x2(%rsp),%xmm0 + call .Lboth_nearone + movapd %xmm0,%xmm7 + jmp .L__vlog4n + + .align 16 +.L__n1nb2: + test $1,%r9d + jz .L__lnn22 + + movlpd p_x2(%rsp),%xmm0 + call .L__ln1 + movsd %xmm0,%xmm7 + +.L__lnn22: + test $2,%r9d # second number? + jz .L__lnn2e + movlpd p_x2+8(%rsp),%xmm0 + call .L__ln1 + movlhps %xmm0,%xmm7 + +.L__lnn2e: + jmp .L__vlog4n + + .align 16 + +.L__ln1: +# saves 10 cycles +# r = x - 1.0; + movlpd .L__real_two(%rip),%xmm2 + subsd .L__real_one(%rip),%xmm0 # r +# u = r / (2.0 + r); + addsd %xmm0,%xmm2 + movsd %xmm0,%xmm1 + divsd %xmm2,%xmm1 # u + movlpd .L__real_ca4(%rip),%xmm4 #D + movlpd .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movsd %xmm0,%xmm6 + mulsd %xmm1,%xmm6 # correction +# u = u + u; + addsd %xmm1,%xmm1 #u + movsd %xmm1,%xmm2 + mulsd %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulsd %xmm1,%xmm5 # Cu + movsd %xmm1,%xmm3 + mulsd %xmm2,%xmm3 # u^3 + mulsd .L__real_ca2(%rip),%xmm2 #Bu^2 + mulsd %xmm3,%xmm4 #Du^3 + + addsd .L__real_ca1(%rip),%xmm2 # +A + movsd %xmm3,%xmm1 + mulsd %xmm1,%xmm1 # u^6 + addsd %xmm4,%xmm5 #Cu+Du3 + + mulsd %xmm3,%xmm2 #u3(A+Bu2) + mulsd %xmm5,%xmm1 #u6(Cu+Du3) + addsd %xmm1,%xmm2 + subsd %xmm6,%xmm2 # -correction + +# loge to log2 + movsd %xmm0,%xmm3 #r1 = r + pand .L__mask_lower(%rip),%xmm3 + subsd %xmm3,%xmm0 + addsd %xmm0,%xmm2 #r2 = r2 + (r - r1); + + movsd %xmm3,%xmm0 + movsd %xmm2,%xmm1 + + mulsd .L__real_log2e_tail(%rip),%xmm2 + mulsd .L__real_log2e_tail(%rip),%xmm0 + mulsd .L__real_log2e_lead(%rip),%xmm1 + mulsd .L__real_log2e_lead(%rip),%xmm3 + addsd %xmm2,%xmm0 + addsd %xmm1,%xmm0 + addsd %xmm3,%xmm0 + +# return r + r2; +# addsd %xmm2,%xmm0 + ret + + .align 16 + +# at least one of the numbers was a nan or infinity +.L__log_naninf: + test $1,%r8d # first number? + jz .L__lninf2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rdx + movlpd p_x(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninf2: + test $2,%r8d # second number? + jz .L__lninfe + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rdx + movlpd p_x+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninfe: + jmp .L__vlog1 # continue processing if not + +# at least one of the numbers was a nan or infinity +.L__log_naninf2: + test $1,%r10d # first number? + jz .L__lninf22 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm7,%xmm1 # save the inputs + mov p_x2(%rsp),%rdx + movlpd p_x2(%rsp),%xmm0 + call .L__lni + shufpd $2,%xmm7,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + movapd %xmm0,%xmm7 + +.L__lninf22: + test $2,%r10d # second number? + jz .L__lninfe2 + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2+8(%rsp),%rdx + movlpd p_x2+8(%rsp),%xmm0 + call .L__lni + shufpd $0,%xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__lninfe2: + jmp .L__vlog3 # continue processing if not + +# a subroutine to treat one number for nan/infinity +# the number is expected in rdx and returned in the low +# half of xmm0 +.L__lni: + mov $0x0000FFFFFFFFFFFFF,%rax + test %rax,%rdx + jnz .L__lnan # jump if mantissa not zero, so it's a NaN +# inf + rcl $1,%rdx + jnc .L__lne2 # log(+inf) = inf +# negative x + movlpd .L__real_nan(%rip),%xmm0 + ret + +#NaN +.L__lnan: + mov $0x00008000000000000,%rax # convert to quiet + or %rax,%rdx +.L__lne: + movd %rdx,%xmm0 +.L__lne2: + ret + + .align 16 + +# at least one of the numbers was a zero, a negative number, or both. +.L__z_or_n: + test $1,%r9d # first number? + jz .L__zn2 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x(%rsp),%rax + call .L__zni + shufpd $2,%xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zn2: + test $2,%r9d # second number? + jz .L__zne + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + movapd %xmm0,%xmm1 # save the inputs + mov p_x+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm1 + movapd %xmm1,%xmm0 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zne: + jmp .L__vlog2 + +.L__z_or_n2: + test $1,%r11d # first number? + jz .L__zn22 + + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2(%rsp),%rax + call .L__zni + shufpd $2,%xmm7,%xmm0 + movapd %xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zn22: + test $2,%r11d # second number? + jz .L__zne2 + mov %rax,p2_temp(%rsp) + mov %rdx,p2_temp+8(%rsp) + mov p_x2+8(%rsp),%rax + call .L__zni + shufpd $0,%xmm0,%xmm7 + mov p2_temp(%rsp),%rax + mov p2_temp+8(%rsp),%rdx + +.L__zne2: + jmp .L__vlog4 +# a subroutine to treat one number for zero or negative values +# the number is expected in rax and returned in the low +# half of xmm0 +.L__zni: + shl $1,%rax + jnz .L__zn_x # if just a carry, then must be negative + movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0 + ret +.L__zn_x: + movlpd .L__real_nan(%rip),%xmm0 + ret + + +# we jump here when we have an odd number of log calls to make at the +# end +# we assume that rdx is pointing at the next x array element, +# r8 at the next y array element. The number of values left is in +# save_nv +.L__vda_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__finish # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + +# fill in a m128d with zeroes and the extra values and then make a recursive call. + xorpd %xmm0,%xmm0 + movlpd %xmm0,p_x+8(%rsp) + movapd %xmm0,p_x+16(%rsp) + + mov (%rsi),%rcx # we know there's at least one + mov %rcx,p_x(%rsp) + cmp $2,%rax + jl .L__vdacg + + mov 8(%rsi),%rcx # do the second value + mov %rcx,p_x+8(%rsp) + cmp $3,%rax + jl .L__vdacg + + mov 16(%rsi),%rcx # do the third value + mov %rcx,p_x+16(%rsp) + +.L__vdacg: + mov $4,%rdi # parameter for N + lea p_x(%rsp),%rsi # &x parameter + lea p2_temp(%rsp),%rdx # &y parameter + call vrda_log2@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + mov p2_temp(%rsp),%rcx + mov %rcx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vdacgf + + mov p2_temp+8(%rsp),%rcx + mov %rcx,8(%rdi) # do the second value + cmp $3,%rax + jl .L__vdacgf + + mov p2_temp+16(%rsp),%rcx + mov %rcx,16(%rdi) # do the third value + +.L__vdacgf: + jmp .L__finish + + .data + .align 64 + +.L__real_one: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_two: .quad 0x04000000000000000 # 2.0 + .quad 0x04000000000000000 +.L__real_ninf: .quad 0x0fff0000000000000 # -inf + .quad 0x0fff0000000000000 +.L__real_inf: .quad 0x07ff0000000000000 # +inf + .quad 0x07ff0000000000000 +.L__real_nan: .quad 0x07ff8000000000000 # NaN + .quad 0x07ff8000000000000 + +.L__real_zero: .quad 0x00000000000000000 # 0.0 + .quad 0x00000000000000000 + +.L__real_sign: .quad 0x08000000000000000 # sign bit + .quad 0x08000000000000000 +.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit + .quad 0x07ffFFFFFFFFFFFFF +.L__real_threshold: .quad 0x03F9EB85000000000 # .03 + .quad 0x03F9EB85000000000 +.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit + .quad 0x00008000000000000 +.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits + .quad 0x0000FFFFFFFFFFFFF +.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */ + .quad 0x03f80000000000000 +.L__mask_1023: .quad 0x000000000000003ff # + .quad 0x000000000000003ff +.L__mask_040: .quad 0x00000000000000040 # + .quad 0x00000000000000040 +.L__mask_001: .quad 0x00000000000000001 # + .quad 0x00000000000000001 + +.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02 + .quad 0x03fb55555555554e6 +.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02 + .quad 0x03f89999999bac6d4 +.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03 + .quad 0x03f62492307f1519f +.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04 + .quad 0x03f3c8034c85dfff0 + +.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02 + .quad 0x03fb5555555555557 +.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02 + .quad 0x03f89999999865ede +.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03 + .quad 0x03f6249423bd94741 +.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01 + .quad 0x03fe62e42e0000000 +.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08 + .quad 0x03e6efa39ef35793c + +.L__real_half: .quad 0x03fe0000000000000 # 1/2 + .quad 0x03fe0000000000000 +.L__real_log2e_lead: .quad 0x03FF7154400000000 # log2e_lead 1.44269180297851562500E+00 + .quad 0x03FF7154400000000 +.L__real_log2e_tail : .quad 0x03ECB295C17F0BBBE # log2e_tail 3.23791044778235969970E-06 + .quad 0x03ECB295C17F0BBBE +.L__mask_lower: .quad 0x0ffffffff00000000 + .quad 0x0ffffffff00000000 + +.L__np_ln_lead_table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02 + .quad 0x3f9f829800000000 # 3.07716131210327148438e-02 + .quad 0x3fa7745800000000 # 4.58095073699951171875e-02 + .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02 + .quad 0x3fb341d700000000 # 7.52233862876892089844e-02 + .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02 + .quad 0x3fba926d00000000 # 1.03796780109405517578e-01 + .quad 0x3fbe270700000000 # 1.17783010005950927734e-01 + .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01 + .quad 0x3fc2955280000000 # 1.45181953907012939453e-01 + .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01 + .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01 + .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01 + .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01 + .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01 + .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01 + .quad 0x3fce270700000000 # 2.35566020011901855469e-01 + .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01 + .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01 + .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01 + .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01 + .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01 + .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01 + .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01 + .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01 + .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01 + .quad 0x3fd686c800000000 # 3.51976394653320312500e-01 + .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01 + .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01 + .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01 + .quad 0x3fd9479400000000 # 3.94993782043457031250e-01 + .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01 + .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01 + .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01 + .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01 + .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01 + .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01 + .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01 + .quad 0x3fde744240000000 # 4.75845873355865478516e-01 + .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01 + .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01 + .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01 + .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01 + .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01 + .quad 0x3fe109f380000000 # 5.32464742660522460938e-01 + .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01 + .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01 + .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01 + .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01 + .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01 + .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01 + .quad 0x3fe307d720000000 # 5.94707071781158447266e-01 + .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01 + .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01 + .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01 + .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01 + .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01 + .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01 + .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01 + .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01 + .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01 + .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01 + .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01 + .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_tail_table: + .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00 + .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09 + .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08 + .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08 + .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08 + .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08 + .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08 + .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08 + .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08 + .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08 + .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08 + .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08 + .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08 + .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10 + .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08 + .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08 + .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08 + .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08 + .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08 + .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08 + .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08 + .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08 + .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08 + .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08 + .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09 + .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09 + .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08 + .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08 + .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08 + .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08 + .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09 + .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08 + .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08 + .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08 + .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08 + .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08 + .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09 + .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08 + .quad 0x03e33071282fb989b # 4.43021445893361960146e-09 + .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08 + .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08 + .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08 + .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08 + .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08 + .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09 + .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08 + .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08 + .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08 + .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08 + .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08 + .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08 + .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08 + .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08 + .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08 + .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08 + .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08 + .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08 + .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09 + .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08 + .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08 + .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08 + .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08 + .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08 + .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08 + .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08 + .quad 0 # for alignment +
diff --git a/src/gas/vrdalogr.S b/src/gas/vrdalogr.S new file mode 100644 index 0000000..4064fb3 --- /dev/null +++ b/src/gas/vrdalogr.S
@@ -0,0 +1,2428 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrdalog.asm +# +# An array implementation of the log libm function. +# +# Prototype: +# +# void vrda_logr(int n, double *x, double *y); +# +# Computes the natural log of x. +# A reduced precision routine. Uses the intel novel reduction technique +# with frcpa. Also uses only 3 polynomial terms to acheive52-18= 34 significant digits +# +# This specialized routine does not handle negative numbers, 0, NaNs, or infin ity. +# This routine is not C99 compliant +# This version can compute logs in 26 +# cycles with n <= 24 +# +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# define local variable storage offsets +.equ p_x,0 # temporary for error checking operation + +.equ p_x2,0x030 # temporary for error checking operation + +.equ save_xa,0x060 #qword +.equ save_ya,0x068 #qword +.equ save_nv,0x070 #qword +.equ p_iter,0x078 # qword storage for number of loop iterations + +.equ save_rbx,0x080 #qword +.equ save_rdi,0x088 #qword + +.equ save_rsi,0x090 #qword + + + +.equ p2_temp,0x0e0 # second temporary for get/put bits operation +.equ p2_temp1,0x0f0 # second temporary for exponent multiply + + + +.equ stack_size,0x0118 + + .weak vrda_logr_ + .set vrda_logr_,__vrda_logr__ + .weak vrda_logr__ + .set vrda_logr__,__vrda_logr__ + +# parameters are passed in by Linux as: +# rdi - int n +# rsi - double *x +# rdx - double *y + + .text + .align 16 + .p2align 4,,15 + +#/* a FORTRAN subroutine implementation of array log +#** VRDA_LOGR(N,X,Y) +#** C equivalent +#*/ +#void vrda_logr_(int * n, double *x, double *y) +#{ +# vrda_logr(*n,x,y); +#} +.globl __vrda_logr__ + .type __vrda_logr__,@function +__vrda_logr__: + mov (%rdi),%edi + + .align 16 + .p2align 4,,15 +.globl vrda_logr + .type vrda_logr,@function +vrda_logr: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx + +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + + mov %rdi,save_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vda_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +# In this second version, process the array 2 values at a time. + +.L__vda_top: +# build the input _m128d + mov save_xa(%rsp),%rsi # get x_array pointer + movlpd (%rsi),%xmm0 + movhpd 8(%rsi),%xmm0 + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + movlpd -16(%rsi),%xmm1 + movhpd -8(%rsi),%xmm1 +# compute the logs + +## if NaN or inf + movdqa %xmm0,p_x(%rsp) # save the input values + +# use the algorithm referenced in the itanic trancendental paper. + +# reduction +# compute r = x frcpa(x) - 1 + movdqa %xmm0,%xmm8 + movdqa %xmm1,%xmm9 + + call __vrd4_frcpa@PLT + movdqa %xmm8,%xmm4 + movdqa %xmm9,%xmm7 +# invert the exponent + psllq $1,%xmm8 + psllq $1,%xmm9 + mulpd %xmm0,%xmm4 # r + mulpd %xmm1,%xmm7 # r + movdqa %xmm8,%xmm5 + paddq .L__mask_rup(%rip),%xmm8 + psrlq $53,%xmm8 + movdqa %xmm9,%xmm6 + paddq .L__mask_rup(%rip),%xmm6 + psrlq $53,%xmm6 + psubq .L__mask_3ff(%rip),%xmm8 + psubq .L__mask_3ff(%rip),%xmm6 + pshufd $0x058,%xmm8,%xmm8 + pshufd $0x058,%xmm6,%xmm6 + + + subpd .L__real_one(%rip),%xmm4 + subpd .L__real_one(%rip),%xmm7 + + cvtdq2pd %xmm8,%xmm0 #N + cvtdq2pd %xmm6,%xmm1 #N + +# compute index for table lookup. if 1/2 bit set, increment the index+exponent + psrlq $42,%xmm5 + psrlq $42,%xmm9 + paddq .L__int_one(%rip),%xmm5 + paddq .L__int_one(%rip),%xmm9 + psrlq $1,%xmm5 + psrlq $1,%xmm9 + pand .L__mask_3ff(%rip),%xmm5 + pand .L__mask_3ff(%rip),%xmm9 + psllq $1,%xmm5 + psllq $1,%xmm9 + + movdqa %xmm5,p_x(%rsp) # move the indexes to a memory location + movdqa %xmm9,p_x2(%rsp) + + + movapd .L__real_third(%rip),%xmm3 + movdqa %xmm3,%xmm5 + movapd %xmm4,%xmm2 + movapd %xmm7,%xmm8 + +# approximation +# compute the polynomial +# p(r) = p1r^2+p2r^3+p3r^4+p4r^5 + + mulpd %xmm4,%xmm2 #r^2 + mulpd %xmm7,%xmm8 #r^2 + +# eliminating the 4th and 5th terms gets us to 8000ulps, or 53-16=37 significant digits +# The routine runs in 60 cycles. + mulpd %xmm4,%xmm3 # 1/3r + mulpd %xmm7,%xmm5 # 1/3r +# lookup the f(k) term + lea .L__np_lnf_table(%rip),%rdx + mov p_x(%rsp),%rcx + mov p_x+8(%rsp),%r9 + movlpd (%rdx,%rcx,8),%xmm6 # lookup + movhpd (%rdx,%r9,8),%xmm6 # lookup + + addpd .L__real_half(%rip),%xmm3 # p2 + p3r + addpd .L__real_half(%rip),%xmm5 # p2 + p3r + + mov p_x2(%rsp),%rcx + mov p_x2+8(%rsp),%r9 + movlpd (%rdx,%rcx,8),%xmm9 # lookup + movhpd (%rdx,%r9,8),%xmm9 # lookup + + mulpd %xmm3,%xmm2 # r2(p2 + p3r) + mulpd %xmm5,%xmm8 # r2(p2 + p3r) + addpd %xmm4,%xmm2 # +r + addpd %xmm7,%xmm8 # +r + + +# reconstruction +# compute ln(x) = T + r + p(r) where +# T = N*ln(2)+ln(1/frcpa(x)) via tab of ln(1/frcpa(y)), where y = 1 + k/256, 0<=k<=255 + + mulpd .L__real_log2(%rip),%xmm0 # compute N*__real_log2 + mulpd .L__real_log2(%rip),%xmm1 # compute N*__real_log2 + addpd %xmm6,%xmm2 # add the new mantissas + addpd %xmm9,%xmm8 # add the new mantissas + addpd %xmm2,%xmm0 + addpd %xmm8,%xmm1 +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + movlpd %xmm0,(%rdi) + movhpd %xmm0,8(%rdi) + + + prefetch 64(%rdi) + add $32,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + +# store the result _m128d + movlpd %xmm1,-16(%rdi) + movhpd %xmm1,-8(%rdi) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vda_top + + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vda_cleanup + + +.L__finish: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + +# we jump here when we have an odd number of log calls to make at the +# end +# we assume that rdx is pointing at the next x array element, +# r8 at the next y array element. The number of values left is in +# save_nv +.L__vda_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__finish # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + +# fill in a m128d with zeroes and the extra values and then make a recursive call. + xorpd %xmm0,%xmm0 + movlpd %xmm0,p_x+8(%rsp) + movapd %xmm0,p_x+16(%rsp) + + mov (%rsi),%rcx # we know there's at least one + mov %rcx,p_x(%rsp) + cmp $2,%rax + jl .L__vdacg + + mov 8(%rsi),%rcx # do the second value + mov %rcx,p_x+8(%rsp) + cmp $3,%rax + jl .L__vdacg + + mov 16(%rsi),%rcx # do the third value + mov %rcx,p_x+16(%rsp) + +.L__vdacg: + mov $4,%rcx # parameter for N + lea p_x(%rsp),%rdx # &x parameter + lea p2_temp(%rsp),%r8 # &y parameter + call vrda_logr@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + mov p2_temp(%rsp),%rcx + mov %rcx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vdacgf + + mov p2_temp+8(%rsp),%rcx + mov %rcx,8(%rdi) # do the second value + cmp $3,%rax + jl .L__vdacgf + + mov p2_temp+16(%rsp),%rcx + mov %rcx,16(%rdi) # do the third value + +.L__vdacgf: + jmp .L__finish + + .data + .align 64 + +.L__real_one: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 + +.L__real_half: .quad 0x0bfe0000000000000 # 1/2 + .quad 0x0bfe0000000000000 +.L__real_third: .quad 0x03fd5555555555555 # 1/3 + .quad 0x03fd5555555555555 +.L__real_fourth: .quad 0x0bfd0000000000000 # 1/4 + .quad 0x0bfd0000000000000 +.L__real_fifth: .quad 0x03fc999999999999a # 1/5 + .quad 0x03fc999999999999a +.L__real_sixth: .quad 0x0bfc5555555555555 # 1/6 + .quad 0x0bfc5555555555555 + +.L__real_log2: .quad 0x03FE62E42FEFA39EF # 0.693147182465 + .quad 0x03FE62E42FEFA39EF + +.L__mask_3ff: .quad 0x000000000000003ff # + .quad 0x000000000000003ff + +.L__mask_rup: .quad 0x0000003fffffffffe + .quad 0x0000003fffffffffe + +.L__int_one: .quad 0x00000000000000001 + .quad 0x00000000000000001 + + + +.L__mask_10bits: .quad 0x000000000000003ff + .quad 0x000000000000003ff + +.L__mask_expext: .quad 0x000000000003ff000 + .quad 0x000000000003ff000 + +.L__mask_expext2: .quad 0x000000000003ff800 + .quad 0x000000000003ff800 + + + + +.L__np_lnf_table: +#log table Program - logtab.c +#Built Jan 18 2006 09:51:57 +#Compiler version 1400 + + .quad 0x00000000000000000 # 0.000000000000 0 + .quad 0x00000000000000000 + .quad 0x03F50020055655885 # 0.000977039648 1 + .quad 0x03F50020055655885 + .quad 0x03F60040155D5881E # 0.001955034836 2 + .quad 0x03F60040155D5881E + .quad 0x03F6809048289860A # 0.002933987435 3 + .quad 0x03F6809048289860A + .quad 0x03F70080559588B25 # 0.003913899321 4 + .quad 0x03F70080559588B25 + .quad 0x03F740C8A7478788D # 0.004894772377 5 + .quad 0x03F740C8A7478788D + .quad 0x03F78121214586B02 # 0.005876608489 6 + .quad 0x03F78121214586B02 + .quad 0x03F7C189CBB0E283F # 0.006859409551 7 + .quad 0x03F7C189CBB0E283F + .quad 0x03F8010157588DE69 # 0.007843177461 8 + .quad 0x03F8010157588DE69 + .quad 0x03F82145E939EF1BC # 0.008827914124 9 + .quad 0x03F82145E939EF1BC + .quad 0x03F83D8896A83D7A8 # 0.009690354884 10 + .quad 0x03F83D8896A83D7A8 + .quad 0x03F85DDC705054DFF # 0.010676913110 11 + .quad 0x03F85DDC705054DFF + .quad 0x03F87E38762CA0C6D # 0.011664445593 12 + .quad 0x03F87E38762CA0C6D + .quad 0x03F89E9CAC6007563 # 0.012652954261 13 + .quad 0x03F89E9CAC6007563 + .quad 0x03F8BF091710935A4 # 0.013642441046 14 + .quad 0x03F8BF091710935A4 + .quad 0x03F8DF7DBA6777895 # 0.014632907884 15 + .quad 0x03F8DF7DBA6777895 + .quad 0x03F8FBEA8B13C03F9 # 0.015500371846 16 + .quad 0x03F8FBEA8B13C03F9 + .quad 0x03F90E3751F24F45C # 0.016492681528 17 + .quad 0x03F90E3751F24F45C + .quad 0x03F91E7D80B1FBF4C # 0.017485976867 18 + .quad 0x03F91E7D80B1FBF4C + .quad 0x03F92CBE4F6CC56C3 # 0.018355920375 19 + .quad 0x03F92CBE4F6CC56C3 + .quad 0x03F93D0C443D7258C # 0.019351069108 20 + .quad 0x03F93D0C443D7258C + .quad 0x03F94D5E6176ACC89 # 0.020347209148 21 + .quad 0x03F94D5E6176ACC89 + .quad 0x03F95DB4A937DEF10 # 0.021344342472 22 + .quad 0x03F95DB4A937DEF10 + .quad 0x03F96C039490E37F4 # 0.022217650494 23 + .quad 0x03F96C039490E37F4 + .quad 0x03F97C61B1CF5DED7 # 0.023216651576 24 + .quad 0x03F97C61B1CF5DED7 + .quad 0x03F98AB77B3FD6EAD # 0.024091596947 25 + .quad 0x03F98AB77B3FD6EAD + .quad 0x03F99B1D75828E780 # 0.025092472797 26 + .quad 0x03F99B1D75828E780 + .quad 0x03F9AB87A478CB7CB # 0.026094351403 27 + .quad 0x03F9AB87A478CB7CB + .quad 0x03F9B9E8027E1916F # 0.026971819338 28 + .quad 0x03F9B9E8027E1916F + .quad 0x03F9CA5A1A18613E6 # 0.027975583538 29 + .quad 0x03F9CA5A1A18613E6 + .quad 0x03F9D8C1670325921 # 0.028854704473 30 + .quad 0x03F9D8C1670325921 + .quad 0x03F9E93B6EE41F674 # 0.029860361378 31 + .quad 0x03F9E93B6EE41F674 + .quad 0x03F9F7A9B16782855 # 0.030741141554 32 + .quad 0x03F9F7A9B16782855 + .quad 0x03FA0415D89E74440 # 0.031748698315 33 + .quad 0x03FA0415D89E74440 + .quad 0x03FA0C58FA19DFAAB # 0.032757271269 34 + .quad 0x03FA0C58FA19DFAAB + .quad 0x03FA139577CC41C1A # 0.033640607815 35 + .quad 0x03FA139577CC41C1A + .quad 0x03FA1AD398C6CD57C # 0.034524725334 36 + .quad 0x03FA1AD398C6CD57C + .quad 0x03FA231C9C40E204E # 0.035536103423 37 + .quad 0x03FA231C9C40E204E + .quad 0x03FA2A5E4231CF7BD # 0.036421899115 38 + .quad 0x03FA2A5E4231CF7BD + .quad 0x03FA32AB4D4C59CB0 # 0.037435198758 39 + .quad 0x03FA32AB4D4C59CB0 + .quad 0x03FA39F07BA0EBD5A # 0.038322679007 40 + .quad 0x03FA39F07BA0EBD5A + .quad 0x03FA424192495D571 # 0.039337907520 41 + .quad 0x03FA424192495D571 + .quad 0x03FA498A4C73DA65D # 0.040227078744 42 + .quad 0x03FA498A4C73DA65D + .quad 0x03FA50D4AF75CA86F # 0.041117041297 43 + .quad 0x03FA50D4AF75CA86F + .quad 0x03FA592BBC15215BC # 0.042135112141 44 + .quad 0x03FA592BBC15215BC + .quad 0x03FA6079B00423FF6 # 0.043026775152 45 + .quad 0x03FA6079B00423FF6 + .quad 0x03FA67C94F2D4BB65 # 0.043919233935 46 + .quad 0x03FA67C94F2D4BB65 + .quad 0x03FA70265A550E77B # 0.044940163069 47 + .quad 0x03FA70265A550E77B + .quad 0x03FA77798F8D6DFDC # 0.045834331871 48 + .quad 0x03FA77798F8D6DFDC + .quad 0x03FA7ECE7267CD123 # 0.046729300926 49 + .quad 0x03FA7ECE7267CD123 + .quad 0x03FA873184BC09586 # 0.047753104446 50 + .quad 0x03FA873184BC09586 + .quad 0x03FA8E8A02D2E3175 # 0.048649793163 51 + .quad 0x03FA8E8A02D2E3175 + .quad 0x03FA95E430F8CE456 # 0.049547286652 52 + .quad 0x03FA95E430F8CE456 + .quad 0x03FA9D400FF482586 # 0.050445586359 53 + .quad 0x03FA9D400FF482586 + .quad 0x03FAA5AB21CB34A9E # 0.051473203662 54 + .quad 0x03FAA5AB21CB34A9E + .quad 0x03FAAD0AA2E784EF4 # 0.052373235867 55 + .quad 0x03FAAD0AA2E784EF4 + .quad 0x03FAB46BD74DA76A0 # 0.053274078860 56 + .quad 0x03FAB46BD74DA76A0 + .quad 0x03FABBCEBFC68F424 # 0.054175734102 57 + .quad 0x03FABBCEBFC68F424 + .quad 0x03FAC3335D1BBAE4D # 0.055078203060 58 + .quad 0x03FAC3335D1BBAE4D + .quad 0x03FACBA87200EB8F1 # 0.056110594428 59 + .quad 0x03FACBA87200EB8F1 + .quad 0x03FAD310BA20455A2 # 0.057014812019 60 + .quad 0x03FAD310BA20455A2 + .quad 0x03FADA7AB998B77ED # 0.057919847959 61 + .quad 0x03FADA7AB998B77ED + .quad 0x03FAE1E6713606CFB # 0.058825703731 62 + .quad 0x03FAE1E6713606CFB + .quad 0x03FAE953E1C48603A # 0.059732380822 63 + .quad 0x03FAE953E1C48603A + .quad 0x03FAF0C30C1116351 # 0.060639880722 64 + .quad 0x03FAF0C30C1116351 + .quad 0x03FAF833F0E927711 # 0.061548204926 65 + .quad 0x03FAF833F0E927711 + .quad 0x03FAFFA6911AB9309 # 0.062457354934 66 + .quad 0x03FAFFA6911AB9309 + .quad 0x03FB038D76BA2D737 # 0.063367332247 67 + .quad 0x03FB038D76BA2D737 + .quad 0x03FB0748836296412 # 0.064278138373 68 + .quad 0x03FB0748836296412 + .quad 0x03FB0B046EEE6F7A4 # 0.065189774824 69 + .quad 0x03FB0B046EEE6F7A4 + .quad 0x03FB0EC139C5DA5FD # 0.066102243114 70 + .quad 0x03FB0EC139C5DA5FD + .quad 0x03FB127EE451413A8 # 0.067015544762 71 + .quad 0x03FB127EE451413A8 + .quad 0x03FB163D6EF9579FC # 0.067929681294 72 + .quad 0x03FB163D6EF9579FC + .quad 0x03FB19FCDA271ABC0 # 0.068844654235 73 + .quad 0x03FB19FCDA271ABC0 + .quad 0x03FB1DBD2643D1912 # 0.069760465119 74 + .quad 0x03FB1DBD2643D1912 + .quad 0x03FB217E53B90D3CE # 0.070677115481 75 + .quad 0x03FB217E53B90D3CE + .quad 0x03FB254062F0A9417 # 0.071594606862 76 + .quad 0x03FB254062F0A9417 + .quad 0x03FB29035454CBCB0 # 0.072512940806 77 + .quad 0x03FB29035454CBCB0 + .quad 0x03FB2CC7284FE5F1A # 0.073432118863 78 + .quad 0x03FB2CC7284FE5F1A + .quad 0x03FB308BDF4CB4062 # 0.074352142586 79 + .quad 0x03FB308BDF4CB4062 + .quad 0x03FB345179B63DD3F # 0.075273013532 80 + .quad 0x03FB345179B63DD3F + .quad 0x03FB3817F7F7D6EAB # 0.076194733263 81 + .quad 0x03FB3817F7F7D6EAB + .quad 0x03FB3BDF5A7D1EE5E # 0.077117303344 82 + .quad 0x03FB3BDF5A7D1EE5E + .quad 0x03FB3F1D405CE86D3 # 0.077908755701 83 + .quad 0x03FB3F1D405CE86D3 + .quad 0x03FB42E64BEC266E4 # 0.078832909176 84 + .quad 0x03FB42E64BEC266E4 + .quad 0x03FB46B03CF437BC4 # 0.079757917501 85 + .quad 0x03FB46B03CF437BC4 + .quad 0x03FB4A7B13E1E3E65 # 0.080683782259 86 + .quad 0x03FB4A7B13E1E3E65 + .quad 0x03FB4E46D1223FE84 # 0.081610505036 87 + .quad 0x03FB4E46D1223FE84 + .quad 0x03FB52137522AE732 # 0.082538087426 88 + .quad 0x03FB52137522AE732 + .quad 0x03FB5555DE434F2A0 # 0.083333843436 89 + .quad 0x03FB5555DE434F2A0 + .quad 0x03FB59242FF043D34 # 0.084263026485 90 + .quad 0x03FB59242FF043D34 + .quad 0x03FB5CF36997817B2 # 0.085193073719 91 + .quad 0x03FB5CF36997817B2 + .quad 0x03FB60C38BA799459 # 0.086123986746 92 + .quad 0x03FB60C38BA799459 + .quad 0x03FB6408F471C82A2 # 0.086922602521 93 + .quad 0x03FB6408F471C82A2 + .quad 0x03FB67DAC7466CB96 # 0.087855127734 94 + .quad 0x03FB67DAC7466CB96 + .quad 0x03FB6BAD83C1883BA # 0.088788523361 95 + .quad 0x03FB6BAD83C1883BA + .quad 0x03FB6EF528C056A2D # 0.089589270768 96 + .quad 0x03FB6EF528C056A2D + .quad 0x03FB72C9985035BB1 # 0.090524287199 97 + .quad 0x03FB72C9985035BB1 + .quad 0x03FB769EF2C6B5688 # 0.091460178704 98 + .quad 0x03FB769EF2C6B5688 + .quad 0x03FB79E8D70A364C6 # 0.092263069152 99 + .quad 0x03FB79E8D70A364C6 + .quad 0x03FB7DBFE6EA733FE # 0.093200590148 100 + .quad 0x03FB7DBFE6EA733FE + .quad 0x03FB8197E2F40E3F0 # 0.094138990914 101 + .quad 0x03FB8197E2F40E3F0 + .quad 0x03FB84E40992A4804 # 0.094944035906 102 + .quad 0x03FB84E40992A4804 + .quad 0x03FB88BDBD5FC66D2 # 0.095884074919 103 + .quad 0x03FB88BDBD5FC66D2 + .quad 0x03FB8C985E9B9EC7E # 0.096824998438 104 + .quad 0x03FB8C985E9B9EC7E + .quad 0x03FB8FE6CAB20E979 # 0.097632209567 105 + .quad 0x03FB8FE6CAB20E979 + .quad 0x03FB93C3261014C65 # 0.098574780162 106 + .quad 0x03FB93C3261014C65 + .quad 0x03FB97130DC9235DE # 0.099383405543 107 + .quad 0x03FB97130DC9235DE + .quad 0x03FB9AF124D64C623 # 0.100327628989 108 + .quad 0x03FB9AF124D64C623 + .quad 0x03FB9E4289871E964 # 0.101137673586 109 + .quad 0x03FB9E4289871E964 + .quad 0x03FBA2225DD276FCB # 0.102083555691 110 + .quad 0x03FBA2225DD276FCB + .quad 0x03FBA57540D1FE441 # 0.102895024494 111 + .quad 0x03FBA57540D1FE441 + .quad 0x03FBA956D3ECADE60 # 0.103842571097 112 + .quad 0x03FBA956D3ECADE60 + .quad 0x03FBACAB3693AB9C0 # 0.104655469123 113 + .quad 0x03FBACAB3693AB9C0 + .quad 0x03FBB08E8A10F96F4 # 0.105604686090 114 + .quad 0x03FBB08E8A10F96F4 + .quad 0x03FBB3E46DBA02181 # 0.106419018383 115 + .quad 0x03FBB3E46DBA02181 + .quad 0x03FBB7C9832F58018 # 0.107369911615 116 + .quad 0x03FBB7C9832F58018 + .quad 0x03FBBB20E936D6976 # 0.108185683244 117 + .quad 0x03FBBB20E936D6976 + .quad 0x03FBBF07C23BC54EA # 0.109138258671 118 + .quad 0x03FBBF07C23BC54EA + .quad 0x03FBC260ABFFFE972 # 0.109955474734 119 + .quad 0x03FBC260ABFFFE972 + .quad 0x03FBC6494A2E418A0 # 0.110909738320 120 + .quad 0x03FBC6494A2E418A0 + .quad 0x03FBC9A3B90F57748 # 0.111728403941 121 + .quad 0x03FBC9A3B90F57748 + .quad 0x03FBCCFEDBFEE13A8 # 0.112547740324 122 + .quad 0x03FBCCFEDBFEE13A8 + .quad 0x03FBD0EA1362CDBFC # 0.113504482008 123 + .quad 0x03FBD0EA1362CDBFC + .quad 0x03FBD446BD753D433 # 0.114325275488 124 + .quad 0x03FBD446BD753D433 + .quad 0x03FBD7A41C8627307 # 0.115146743223 125 + .quad 0x03FBD7A41C8627307 + .quad 0x03FBDB91F09680DF9 # 0.116105975911 126 + .quad 0x03FBDB91F09680DF9 + .quad 0x03FBDEF0D8D466DBB # 0.116928908339 127 + .quad 0x03FBDEF0D8D466DBB + .quad 0x03FBE2507702AF03B # 0.117752518544 128 + .quad 0x03FBE2507702AF03B + .quad 0x03FBE640EB3D2B411 # 0.118714255240 129 + .quad 0x03FBE640EB3D2B411 + .quad 0x03FBE9A214A69DD58 # 0.119539337795 130 + .quad 0x03FBE9A214A69DD58 + .quad 0x03FBED03F4F440969 # 0.120365101673 131 + .quad 0x03FBED03F4F440969 + .quad 0x03FBF0F70CDD992E4 # 0.121329355484 132 + .quad 0x03FBF0F70CDD992E4 + .quad 0x03FBF45A7A78B7C3B # 0.122156599431 133 + .quad 0x03FBF45A7A78B7C3B + .quad 0x03FBF7BE9FEDBFDED # 0.122984528276 134 + .quad 0x03FBF7BE9FEDBFDED + .quad 0x03FBFB237D8AB13FB # 0.123813143156 135 + .quad 0x03FBFB237D8AB13FB + .quad 0x03FBFF1A13EAC95FD # 0.124780729104 136 + .quad 0x03FBFF1A13EAC95FD + .quad 0x03FC014040CAB0229 # 0.125610834299 137 + .quad 0x03FC014040CAB0229 + .quad 0x03FC02F3D4301417B # 0.126441629140 138 + .quad 0x03FC02F3D4301417B + .quad 0x03FC04A7C44CF87A4 # 0.127273114776 139 + .quad 0x03FC04A7C44CF87A4 + .quad 0x03FC06A4D1D26C5E9 # 0.128244055971 140 + .quad 0x03FC06A4D1D26C5E9 + .quad 0x03FC08598B59E3A07 # 0.129077042275 141 + .quad 0x03FC08598B59E3A07 + .quad 0x03FC0A0EA2164AF02 # 0.129910723024 142 + .quad 0x03FC0A0EA2164AF02 + .quad 0x03FC0BC4162F73B66 # 0.130745099376 143 + .quad 0x03FC0BC4162F73B66 + .quad 0x03FC0D79E7CD48E58 # 0.131580172493 144 + .quad 0x03FC0D79E7CD48E58 + .quad 0x03FC0F301717CF0FB # 0.132415943541 145 + .quad 0x03FC0F301717CF0FB + .quad 0x03FC10E6A437247B7 # 0.133252413686 146 + .quad 0x03FC10E6A437247B7 + .quad 0x03FC12E6BFA8FEAD6 # 0.134229180665 147 + .quad 0x03FC12E6BFA8FEAD6 + .quad 0x03FC149E189F8642E # 0.135067169541 148 + .quad 0x03FC149E189F8642E + .quad 0x03FC1655CFEA923A4 # 0.135905861231 149 + .quad 0x03FC1655CFEA923A4 + .quad 0x03FC180DE5B2ACE5C # 0.136745256915 150 + .quad 0x03FC180DE5B2ACE5C + .quad 0x03FC19C65A207AC07 # 0.137585357777 151 + .quad 0x03FC19C65A207AC07 + .quad 0x03FC1B7F2D5CBA842 # 0.138426165001 152 + .quad 0x03FC1B7F2D5CBA842 + .quad 0x03FC1D385F90453F2 # 0.139267679777 153 + .quad 0x03FC1D385F90453F2 + .quad 0x03FC1EF1F0E40E6CD # 0.140109903297 154 + .quad 0x03FC1EF1F0E40E6CD + .quad 0x03FC20ABE18124098 # 0.140952836755 155 + .quad 0x03FC20ABE18124098 + .quad 0x03FC22663190AEACC # 0.141796481350 156 + .quad 0x03FC22663190AEACC + .quad 0x03FC2420E13BF19E3 # 0.142640838281 157 + .quad 0x03FC2420E13BF19E3 + .quad 0x03FC25DBF0AC4AED2 # 0.143485908754 158 + .quad 0x03FC25DBF0AC4AED2 + .quad 0x03FC2797600B3387B # 0.144331693975 159 + .quad 0x03FC2797600B3387B + .quad 0x03FC29532F823F525 # 0.145178195155 160 + .quad 0x03FC29532F823F525 + .quad 0x03FC2B0F5F3B1D3EF # 0.146025413505 161 + .quad 0x03FC2B0F5F3B1D3EF + .quad 0x03FC2CCBEF5F97653 # 0.146873350243 162 + .quad 0x03FC2CCBEF5F97653 + .quad 0x03FC2E88E01993187 # 0.147722006588 163 + .quad 0x03FC2E88E01993187 + .quad 0x03FC3046319311009 # 0.148571383763 164 + .quad 0x03FC3046319311009 + .quad 0x03FC3203E3F62D328 # 0.149421482992 165 + .quad 0x03FC3203E3F62D328 + .quad 0x03FC33C1F76D1F469 # 0.150272305505 166 + .quad 0x03FC33C1F76D1F469 + .quad 0x03FC35806C223A70F # 0.151123852534 167 + .quad 0x03FC35806C223A70F + .quad 0x03FC373F423FED9A1 # 0.151976125313 168 + .quad 0x03FC373F423FED9A1 + .quad 0x03FC38FE79F0C3771 # 0.152829125080 169 + .quad 0x03FC38FE79F0C3771 + .quad 0x03FC3ABE135F62A12 # 0.153682853077 170 + .quad 0x03FC3ABE135F62A12 + .quad 0x03FC3C335E0447D71 # 0.154394850259 171 + .quad 0x03FC3C335E0447D71 + .quad 0x03FC3DF3AB13505F9 # 0.155249916579 172 + .quad 0x03FC3DF3AB13505F9 + .quad 0x03FC3FB45A59928CA # 0.156105714663 173 + .quad 0x03FC3FB45A59928CA + .quad 0x03FC41756C0220C81 # 0.156962245765 174 + .quad 0x03FC41756C0220C81 + .quad 0x03FC4336E03829D61 # 0.157819511141 175 + .quad 0x03FC4336E03829D61 + .quad 0x03FC44F8B726F8EFE # 0.158677512051 176 + .quad 0x03FC44F8B726F8EFE + .quad 0x03FC46BAF0F9F5DB8 # 0.159536249760 177 + .quad 0x03FC46BAF0F9F5DB8 + .quad 0x03FC48326CD3EC797 # 0.160252428262 178 + .quad 0x03FC48326CD3EC797 + .quad 0x03FC49F55C6502F81 # 0.161112520058 179 + .quad 0x03FC49F55C6502F81 + .quad 0x03FC4BB8AF55DE908 # 0.161973352249 180 + .quad 0x03FC4BB8AF55DE908 + .quad 0x03FC4D7C65D25566D # 0.162834926111 181 + .quad 0x03FC4D7C65D25566D + .quad 0x03FC4F4080065AA7F # 0.163697242922 182 + .quad 0x03FC4F4080065AA7F + .quad 0x03FC50B98CD30A759 # 0.164416408720 183 + .quad 0x03FC50B98CD30A759 + .quad 0x03FC527E5E4A1B58D # 0.165280090939 184 + .quad 0x03FC527E5E4A1B58D + .quad 0x03FC544393F5DF80F # 0.166144519750 185 + .quad 0x03FC544393F5DF80F + .quad 0x03FC56092E02BA514 # 0.167009696444 186 + .quad 0x03FC56092E02BA514 + .quad 0x03FC57837B3098F2C # 0.167731249257 187 + .quad 0x03FC57837B3098F2C + .quad 0x03FC5949CDB873419 # 0.168597800437 188 + .quad 0x03FC5949CDB873419 + .quad 0x03FC5B10851FC924A # 0.169465103180 189 + .quad 0x03FC5B10851FC924A + .quad 0x03FC5C8BC079D8289 # 0.170188430518 190 + .quad 0x03FC5C8BC079D8289 + .quad 0x03FC5E533144C1718 # 0.171057114516 191 + .quad 0x03FC5E533144C1718 + .quad 0x03FC601B076E7A8A8 # 0.171926553783 192 + .quad 0x03FC601B076E7A8A8 + .quad 0x03FC619732215D786 # 0.172651664394 193 + .quad 0x03FC619732215D786 + .quad 0x03FC635FC298F6C77 # 0.173522491735 194 + .quad 0x03FC635FC298F6C77 + .quad 0x03FC6528B8EFA5D16 # 0.174394078077 195 + .quad 0x03FC6528B8EFA5D16 + .quad 0x03FC66A5D42A3AD33 # 0.175120980777 196 + .quad 0x03FC66A5D42A3AD33 + .quad 0x03FC686F85BAD4298 # 0.175993962063 197 + .quad 0x03FC686F85BAD4298 + .quad 0x03FC6A399DABBD383 # 0.176867706111 198 + .quad 0x03FC6A399DABBD383 + .quad 0x03FC6BB7AA9F22C40 # 0.177596409780 199 + .quad 0x03FC6BB7AA9F22C40 + .quad 0x03FC6D827EB7C1E57 # 0.178471555693 200 + .quad 0x03FC6D827EB7C1E57 + .quad 0x03FC6F0128B756AB9 # 0.179201429458 201 + .quad 0x03FC6F0128B756AB9 + .quad 0x03FC70CCB9927BCF6 # 0.180077981742 202 + .quad 0x03FC70CCB9927BCF6 + .quad 0x03FC7298B1A4E32B6 # 0.180955303044 203 + .quad 0x03FC7298B1A4E32B6 + .quad 0x03FC74184F58CC7DC # 0.181686992547 204 + .quad 0x03FC74184F58CC7DC + .quad 0x03FC75E5051E74141 # 0.182565727226 205 + .quad 0x03FC75E5051E74141 + .quad 0x03FC77654128F6127 # 0.183298596442 206 + .quad 0x03FC77654128F6127 + .quad 0x03FC7932B53E97639 # 0.184178749058 207 + .quad 0x03FC7932B53E97639 + .quad 0x03FC7AB390229D8FD # 0.184912801796 208 + .quad 0x03FC7AB390229D8FD + .quad 0x03FC7C81C325B4A5E # 0.185794376934 209 + .quad 0x03FC7C81C325B4A5E + .quad 0x03FC7E033D66CD24A # 0.186529617023 210 + .quad 0x03FC7E033D66CD24A + .quad 0x03FC7FD22FF599D4C # 0.187412619288 211 + .quad 0x03FC7FD22FF599D4C + .quad 0x03FC81544A17F67C1 # 0.188149050576 212 + .quad 0x03FC81544A17F67C1 + .quad 0x03FC8323FCD17DAC8 # 0.189033484595 213 + .quad 0x03FC8323FCD17DAC8 + .quad 0x03FC84A6B759F512D # 0.189771110947 214 + .quad 0x03FC84A6B759F512D + .quad 0x03FC86772ADE0201C # 0.190656981373 215 + .quad 0x03FC86772ADE0201C + .quad 0x03FC87FA865210911 # 0.191395806674 216 + .quad 0x03FC87FA865210911 + .quad 0x03FC89CBBB4136201 # 0.192283118179 217 + .quad 0x03FC89CBBB4136201 + .quad 0x03FC8B4FB826FF291 # 0.193023146334 218 + .quad 0x03FC8B4FB826FF291 + .quad 0x03FC8D21AF2299298 # 0.193911903613 219 + .quad 0x03FC8D21AF2299298 + .quad 0x03FC8EA64E00E7FC0 # 0.194653138545 220 + .quad 0x03FC8EA64E00E7FC0 + .quad 0x03FC902B36AB7681D # 0.195394923313 221 + .quad 0x03FC902B36AB7681D + .quad 0x03FC91FE49096581E # 0.196285791969 222 + .quad 0x03FC91FE49096581E + .quad 0x03FC9383D471B869B # 0.197028789254 223 + .quad 0x03FC9383D471B869B + .quad 0x03FC9557AA6B87F65 # 0.197921115309 224 + .quad 0x03FC9557AA6B87F65 + .quad 0x03FC96DDD91A0B959 # 0.198665329082 225 + .quad 0x03FC96DDD91A0B959 + .quad 0x03FC9864522D04491 # 0.199410097121 226 + .quad 0x03FC9864522D04491 + .quad 0x03FC9A3945D1A44B3 # 0.200304551564 227 + .quad 0x03FC9A3945D1A44B3 + .quad 0x03FC9BC062F26FC3B # 0.201050541900 228 + .quad 0x03FC9BC062F26FC3B + .quad 0x03FC9D47CAD2C1871 # 0.201797089154 229 + .quad 0x03FC9D47CAD2C1871 + .quad 0x03FC9F1DDD7FE4F8B # 0.202693682161 230 + .quad 0x03FC9F1DDD7FE4F8B + .quad 0x03FCA0A5EA371A910 # 0.203441457564 231 + .quad 0x03FCA0A5EA371A910 + .quad 0x03FCA22E42098F498 # 0.204189792554 232 + .quad 0x03FCA22E42098F498 + .quad 0x03FCA405751F6CCE4 # 0.205088534376 233 + .quad 0x03FCA405751F6CCE4 + .quad 0x03FCA58E729348F40 # 0.205838103409 234 + .quad 0x03FCA58E729348F40 + .quad 0x03FCA717BB7EC64A3 # 0.206588234717 235 + .quad 0x03FCA717BB7EC64A3 + .quad 0x03FCA8F010601E5FD # 0.207489135679 236 + .quad 0x03FCA8F010601E5FD + .quad 0x03FCAA79FFB8FCD48 # 0.208240506966 237 + .quad 0x03FCAA79FFB8FCD48 + .quad 0x03FCAC043AE68965A # 0.208992443238 238 + .quad 0x03FCAC043AE68965A + .quad 0x03FCAD8EC205FB6AD # 0.209744945343 239 + .quad 0x03FCAD8EC205FB6AD + .quad 0x03FCAF6895610DBAD # 0.210648695969 240 + .quad 0x03FCAF6895610DBAD + .quad 0x03FCB0F3C3FBD65C9 # 0.211402445910 241 + .quad 0x03FCB0F3C3FBD65C9 + .quad 0x03FCB27F3EE674219 # 0.212156764419 242 + .quad 0x03FCB27F3EE674219 + .quad 0x03FCB40B063E65B0F # 0.212911652354 243 + .quad 0x03FCB40B063E65B0F + .quad 0x03FCB5E65A8096C88 # 0.213818270730 244 + .quad 0x03FCB5E65A8096C88 + .quad 0x03FCB772CA646760C # 0.214574414434 245 + .quad 0x03FCB772CA646760C + .quad 0x03FCB8FF871461198 # 0.215331130323 246 + .quad 0x03FCB8FF871461198 + .quad 0x03FCBA8C90AE4AD19 # 0.216088419265 247 + .quad 0x03FCBA8C90AE4AD19 + .quad 0x03FCBC19E74FFCBDA # 0.216846282128 248 + .quad 0x03FCBC19E74FFCBDA + .quad 0x03FCBDF71B83DAE7A # 0.217756476365 249 + .quad 0x03FCBDF71B83DAE7A + .quad 0x03FCBF851C067555C # 0.218515604922 250 + .quad 0x03FCBF851C067555C + .quad 0x03FCC11369F0CDB3C # 0.219275310193 251 + .quad 0x03FCC11369F0CDB3C + .quad 0x03FCC2A205610593E # 0.220035593055 252 + .quad 0x03FCC2A205610593E + .quad 0x03FCC430EE755023B # 0.220796454387 253 + .quad 0x03FCC430EE755023B + .quad 0x03FCC5C0254BF23A8 # 0.221557895069 254 + .quad 0x03FCC5C0254BF23A8 + .quad 0x03FCC79F9AB632BF1 # 0.222472389875 255 + .quad 0x03FCC79F9AB632BF1 + .quad 0x03FCC92F7D09ABE20 # 0.223235108240 256 + .quad 0x03FCC92F7D09ABE20 + .quad 0x03FCCABFAD80D023D # 0.223998408788 257 + .quad 0x03FCCABFAD80D023D + .quad 0x03FCCC502C3A2F1E8 # 0.224762292410 258 + .quad 0x03FCCC502C3A2F1E8 + .quad 0x03FCCDE0F9546A5E7 # 0.225526759995 259 + .quad 0x03FCCDE0F9546A5E7 + .quad 0x03FCCF7214EE356E9 # 0.226291812439 260 + .quad 0x03FCCF7214EE356E9 + .quad 0x03FCD1037F2655E7B # 0.227057450635 261 + .quad 0x03FCD1037F2655E7B + .quad 0x03FCD295381BA37E9 # 0.227823675483 262 + .quad 0x03FCD295381BA37E9 + .quad 0x03FCD4273FED08111 # 0.228590487882 263 + .quad 0x03FCD4273FED08111 + .quad 0x03FCD5B996B97FB5F # 0.229357888733 264 + .quad 0x03FCD5B996B97FB5F + .quad 0x03FCD74C3CA018C9C # 0.230125878940 265 + .quad 0x03FCD74C3CA018C9C + .quad 0x03FCD8DF31BFF3FF2 # 0.230894459410 266 + .quad 0x03FCD8DF31BFF3FF2 + .quad 0x03FCDA727638446A1 # 0.231663631050 267 + .quad 0x03FCDA727638446A1 + .quad 0x03FCDC56CAE452F5B # 0.232587418645 268 + .quad 0x03FCDC56CAE452F5B + .quad 0x03FCDDEABE5A3926E # 0.233357894066 269 + .quad 0x03FCDDEABE5A3926E + .quad 0x03FCDF7F018CE771F # 0.234128963578 270 + .quad 0x03FCDF7F018CE771F + .quad 0x03FCE113949BDEC62 # 0.234900628096 271 + .quad 0x03FCE113949BDEC62 + .quad 0x03FCE2A877A6B2C0F # 0.235672888541 272 + .quad 0x03FCE2A877A6B2C0F + .quad 0x03FCE43DAACD09BEC # 0.236445745833 273 + .quad 0x03FCE43DAACD09BEC + .quad 0x03FCE5D32E2E9CE87 # 0.237219200895 274 + .quad 0x03FCE5D32E2E9CE87 + .quad 0x03FCE76901EB38427 # 0.237993254653 275 + .quad 0x03FCE76901EB38427 + .quad 0x03FCE8ADE53F76866 # 0.238612929343 276 + .quad 0x03FCE8ADE53F76866 + .quad 0x03FCEA4449F04AAF4 # 0.239388063093 277 + .quad 0x03FCEA4449F04AAF4 + .quad 0x03FCEBDAFF5593E99 # 0.240163798141 278 + .quad 0x03FCEBDAFF5593E99 + .quad 0x03FCED72058F666C5 # 0.240940135421 279 + .quad 0x03FCED72058F666C5 + .quad 0x03FCEF095CBDE9937 # 0.241717075868 280 + .quad 0x03FCEF095CBDE9937 + .quad 0x03FCF0A1050157ED6 # 0.242494620422 281 + .quad 0x03FCF0A1050157ED6 + .quad 0x03FCF238FE79FF4BF # 0.243272770021 282 + .quad 0x03FCF238FE79FF4BF + .quad 0x03FCF3D1494840D2F # 0.244051525609 283 + .quad 0x03FCF3D1494840D2F + .quad 0x03FCF569E58C91077 # 0.244830888130 284 + .quad 0x03FCF569E58C91077 + .quad 0x03FCF702D36777DF0 # 0.245610858531 285 + .quad 0x03FCF702D36777DF0 + .quad 0x03FCF89C12F990D0C # 0.246391437760 286 + .quad 0x03FCF89C12F990D0C + .quad 0x03FCFA35A4638AE2C # 0.247172626770 287 + .quad 0x03FCFA35A4638AE2C + .quad 0x03FCFB7D86EEE3B92 # 0.247798017660 288 + .quad 0x03FCFB7D86EEE3B92 + .quad 0x03FCFD17ABFCDB683 # 0.248580306677 289 + .quad 0x03FCFD17ABFCDB683 + .quad 0x03FCFEB2233EA07CB # 0.249363208150 290 + .quad 0x03FCFEB2233EA07CB + .quad 0x03FD0026766A9671C # 0.250146723037 291 + .quad 0x03FD0026766A9671C + .quad 0x03FD00F40470C7323 # 0.250930852302 292 + .quad 0x03FD00F40470C7323 + .quad 0x03FD01C1BBC2735A3 # 0.251715596908 293 + .quad 0x03FD01C1BBC2735A3 + .quad 0x03FD028F9C7035C1D # 0.252500957822 294 + .quad 0x03FD028F9C7035C1D + .quad 0x03FD03346E0106062 # 0.253129690945 295 + .quad 0x03FD03346E0106062 + .quad 0x03FD0402994B4F041 # 0.253916163656 296 + .quad 0x03FD0402994B4F041 + .quad 0x03FD04D0EE20620AF # 0.254703255393 297 + .quad 0x03FD04D0EE20620AF + .quad 0x03FD059F6C910034D # 0.255490967131 298 + .quad 0x03FD059F6C910034D + .quad 0x03FD066E14ADF4BFD # 0.256279299848 299 + .quad 0x03FD066E14ADF4BFD + .quad 0x03FD07138604D5864 # 0.256910413785 300 + .quad 0x03FD07138604D5864 + .quad 0x03FD07E2794F3E8C1 # 0.257699866735 301 + .quad 0x03FD07E2794F3E8C1 + .quad 0x03FD08B196753A125 # 0.258489943414 302 + .quad 0x03FD08B196753A125 + .quad 0x03FD0980DD87BA2DD # 0.259280644807 303 + .quad 0x03FD0980DD87BA2DD + .quad 0x03FD0A504E97BB40C # 0.260071971904 304 + .quad 0x03FD0A504E97BB40C + .quad 0x03FD0AF660EB9E278 # 0.260705484754 305 + .quad 0x03FD0AF660EB9E278 + .quad 0x03FD0BC61DBBA97CB # 0.261497940616 306 + .quad 0x03FD0BC61DBBA97CB + .quad 0x03FD0C9604B8FC51E # 0.262291024962 307 + .quad 0x03FD0C9604B8FC51E + .quad 0x03FD0D3C7586CD5E5 # 0.262925945618 308 + .quad 0x03FD0D3C7586CD5E5 + .quad 0x03FD0E0CA89A72D29 # 0.263720163752 309 + .quad 0x03FD0E0CA89A72D29 + .quad 0x03FD0EDD060B78082 # 0.264515013170 310 + .quad 0x03FD0EDD060B78082 + .quad 0x03FD0FAD8DEB1E2C0 # 0.265310494876 311 + .quad 0x03FD0FAD8DEB1E2C0 + .quad 0x03FD10547F9D26ABC # 0.265947336165 312 + .quad 0x03FD10547F9D26ABC + .quad 0x03FD1125540925114 # 0.266743958529 313 + .quad 0x03FD1125540925114 + .quad 0x03FD11F653144CB8B # 0.267541216005 314 + .quad 0x03FD11F653144CB8B + .quad 0x03FD129DA43F5BE9E # 0.268179479949 315 + .quad 0x03FD129DA43F5BE9E + .quad 0x03FD136EF02E8290C # 0.268977883185 316 + .quad 0x03FD136EF02E8290C + .quad 0x03FD144066EDAE406 # 0.269776924378 317 + .quad 0x03FD144066EDAE406 + .quad 0x03FD14E817FF359D7 # 0.270416617347 318 + .quad 0x03FD14E817FF359D7 + .quad 0x03FD15B9DBFA9DEC8 # 0.271216809436 319 + .quad 0x03FD15B9DBFA9DEC8 + .quad 0x03FD168BCAF73B3EB # 0.272017642345 320 + .quad 0x03FD168BCAF73B3EB + .quad 0x03FD1733DC5D68DE8 # 0.272658770753 321 + .quad 0x03FD1733DC5D68DE8 + .quad 0x03FD180618EF18ADE # 0.273460759729 322 + .quad 0x03FD180618EF18ADE + .quad 0x03FD18D880B3826FE # 0.274263392407 323 + .quad 0x03FD18D880B3826FE + .quad 0x03FD1980F2DD42B6F # 0.274905962710 324 + .quad 0x03FD1980F2DD42B6F + .quad 0x03FD1A53A8902E70B # 0.275709756661 325 + .quad 0x03FD1A53A8902E70B + .quad 0x03FD1AFC59297024D # 0.276353257326 326 + .quad 0x03FD1AFC59297024D + .quad 0x03FD1BCF5D04AE1EA # 0.277158215914 327 + .quad 0x03FD1BCF5D04AE1EA + .quad 0x03FD1CA28C64BAE54 # 0.277963822983 328 + .quad 0x03FD1CA28C64BAE54 + .quad 0x03FD1D4B9E796C245 # 0.278608776246 329 + .quad 0x03FD1D4B9E796C245 + .quad 0x03FD1E1F1C5C3A06C # 0.279415553216 330 + .quad 0x03FD1E1F1C5C3A06C + .quad 0x03FD1EC86D5747AAD # 0.280061443760 331 + .quad 0x03FD1EC86D5747AAD + .quad 0x03FD1F9C39F74C559 # 0.280869394034 332 + .quad 0x03FD1F9C39F74C559 + .quad 0x03FD2070326F1F789 # 0.281677997620 333 + .quad 0x03FD2070326F1F789 + .quad 0x03FD2119E59F8789C # 0.282325351583 334 + .quad 0x03FD2119E59F8789C + .quad 0x03FD21EE2D300381C # 0.283135133796 335 + .quad 0x03FD21EE2D300381C + .quad 0x03FD22981FBEF797A # 0.283783432036 336 + .quad 0x03FD22981FBEF797A + .quad 0x03FD236CB6A339EED # 0.284594396317 337 + .quad 0x03FD236CB6A339EED + .quad 0x03FD2416E8C01F606 # 0.285243641592 338 + .quad 0x03FD2416E8C01F606 + .quad 0x03FD24EBCF3387FF6 # 0.286055791397 339 + .quad 0x03FD24EBCF3387FF6 + .quad 0x03FD2596410DF963A # 0.286705986479 340 + .quad 0x03FD2596410DF963A + .quad 0x03FD266B774C2AF55 # 0.287519325279 341 + .quad 0x03FD266B774C2AF55 + .quad 0x03FD27162913F873F # 0.288170472950 342 + .quad 0x03FD27162913F873F + .quad 0x03FD27EBAF58D8C9C # 0.288985004232 343 + .quad 0x03FD27EBAF58D8C9C + .quad 0x03FD2896A13E086A3 # 0.289637107288 344 + .quad 0x03FD2896A13E086A3 + .quad 0x03FD296C77C5C0E13 # 0.290452834554 345 + .quad 0x03FD296C77C5C0E13 + .quad 0x03FD2A17A9F88EDD2 # 0.291105895801 346 + .quad 0x03FD2A17A9F88EDD2 + .quad 0x03FD2AEDD0FF8CC2C # 0.291922822568 347 + .quad 0x03FD2AEDD0FF8CC2C + .quad 0x03FD2B9943B06BD77 # 0.292576844829 348 + .quad 0x03FD2B9943B06BD77 + .quad 0x03FD2C6FBB7360D0E # 0.293394974630 349 + .quad 0x03FD2C6FBB7360D0E + .quad 0x03FD2D1B6ED2FA90C # 0.294049960734 350 + .quad 0x03FD2D1B6ED2FA90C + .quad 0x03FD2DC73F01B0DD4 # 0.294705376127 351 + .quad 0x03FD2DC73F01B0DD4 + .quad 0x03FD2E9E2BCE12286 # 0.295525249913 352 + .quad 0x03FD2E9E2BCE12286 + .quad 0x03FD2F4A3CF22EDC2 # 0.296181633264 353 + .quad 0x03FD2F4A3CF22EDC2 + .quad 0x03FD30217B1006601 # 0.297002718785 354 + .quad 0x03FD30217B1006601 + .quad 0x03FD30CDCD5ABA762 # 0.297660072959 355 + .quad 0x03FD30CDCD5ABA762 + .quad 0x03FD31A55D07A8590 # 0.298482373803 356 + .quad 0x03FD31A55D07A8590 + .quad 0x03FD3251F0AA5CC1A # 0.299140701674 357 + .quad 0x03FD3251F0AA5CC1A + .quad 0x03FD32FEA167A6D70 # 0.299799463226 358 + .quad 0x03FD32FEA167A6D70 + .quad 0x03FD33D6A7509D491 # 0.300623525901 359 + .quad 0x03FD33D6A7509D491 + .quad 0x03FD348399ADA9D94 # 0.301283265328 360 + .quad 0x03FD348399ADA9D94 + .quad 0x03FD3530A9454ADC9 # 0.301943440298 361 + .quad 0x03FD3530A9454ADC9 + .quad 0x03FD360925EC44F5C # 0.302769272371 362 + .quad 0x03FD360925EC44F5C + .quad 0x03FD36B6776BE1116 # 0.303430429420 363 + .quad 0x03FD36B6776BE1116 + .quad 0x03FD378F469437FB4 # 0.304257490918 364 + .quad 0x03FD378F469437FB4 + .quad 0x03FD383CDA2E14ECB # 0.304919632971 365 + .quad 0x03FD383CDA2E14ECB + .quad 0x03FD38EA8B3924521 # 0.305582213748 366 + .quad 0x03FD38EA8B3924521 + .quad 0x03FD39C3D1FD60E74 # 0.306411057558 367 + .quad 0x03FD39C3D1FD60E74 + .quad 0x03FD3A71C56BB48C7 # 0.307074627589 368 + .quad 0x03FD3A71C56BB48C7 + .quad 0x03FD3B1FD66BC8D10 # 0.307738638238 369 + .quad 0x03FD3B1FD66BC8D10 + .quad 0x03FD3BF995502CB5C # 0.308569272059 370 + .quad 0x03FD3BF995502CB5C + .quad 0x03FD3CA7E8FD01DF6 # 0.309234276240 371 + .quad 0x03FD3CA7E8FD01DF6 + .quad 0x03FD3D565A5C5BF11 # 0.309899722945 372 + .quad 0x03FD3D565A5C5BF11 + .quad 0x03FD3E3091E6049FB # 0.310732154526 373 + .quad 0x03FD3E3091E6049FB + .quad 0x03FD3EDF463C1683E # 0.311398599069 374 + .quad 0x03FD3EDF463C1683E + .quad 0x03FD3F8E1865A82DD # 0.312065488057 375 + .quad 0x03FD3F8E1865A82DD + .quad 0x03FD403D086CEA79B # 0.312732822082 376 + .quad 0x03FD403D086CEA79B + .quad 0x03FD4117DE854CA15 # 0.313567616354 377 + .quad 0x03FD4117DE854CA15 + .quad 0x03FD41C711E4BA15E # 0.314235953889 378 + .quad 0x03FD41C711E4BA15E + .quad 0x03FD427663431B221 # 0.314904738398 379 + .quad 0x03FD427663431B221 + .quad 0x03FD4325D2AAB6F18 # 0.315573970480 380 + .quad 0x03FD4325D2AAB6F18 + .quad 0x03FD44014838E5513 # 0.316411140893 381 + .quad 0x03FD44014838E5513 + .quad 0x03FD44B0FB5AF4F44 # 0.317081382205 382 + .quad 0x03FD44B0FB5AF4F44 + .quad 0x03FD4560CCA7CB3B2 # 0.317752073041 383 + .quad 0x03FD4560CCA7CB3B2 + .quad 0x03FD4610BC29C5E18 # 0.318423214006 384 + .quad 0x03FD4610BC29C5E18 + .quad 0x03FD46ECD216CDCB5 # 0.319262774126 385 + .quad 0x03FD46ECD216CDCB5 + .quad 0x03FD479D05B65CB60 # 0.319934930091 386 + .quad 0x03FD479D05B65CB60 + .quad 0x03FD484D57ACE5A1A # 0.320607538154 387 + .quad 0x03FD484D57ACE5A1A + .quad 0x03FD48FDC804DD1CB # 0.321280598924 388 + .quad 0x03FD48FDC804DD1CB + .quad 0x03FD49DA7F3BCC420 # 0.322122562432 389 + .quad 0x03FD49DA7F3BCC420 + .quad 0x03FD4A8B341552B09 # 0.322796644021 390 + .quad 0x03FD4A8B341552B09 + .quad 0x03FD4B3C077267E9A # 0.323471180303 391 + .quad 0x03FD4B3C077267E9A + .quad 0x03FD4BECF95D97914 # 0.324146171892 392 + .quad 0x03FD4BECF95D97914 + .quad 0x03FD4C9E09E172C3D # 0.324821619401 393 + .quad 0x03FD4C9E09E172C3D + .quad 0x03FD4D4F3908901A0 # 0.325497523449 394 + .quad 0x03FD4D4F3908901A0 + .quad 0x03FD4E2CDF1F341C1 # 0.326343046455 395 + .quad 0x03FD4E2CDF1F341C1 + .quad 0x03FD4EDE535C79642 # 0.327019979972 396 + .quad 0x03FD4EDE535C79642 + .quad 0x03FD4F8FE65F90500 # 0.327697372039 397 + .quad 0x03FD4F8FE65F90500 + .quad 0x03FD5041983326F2D # 0.328375223276 398 + .quad 0x03FD5041983326F2D + .quad 0x03FD50F368E1F0F02 # 0.329053534308 399 + .quad 0x03FD50F368E1F0F02 + .quad 0x03FD51A55876A77F5 # 0.329732305758 400 + .quad 0x03FD51A55876A77F5 + .quad 0x03FD5283EF743F98B # 0.330581418486 401 + .quad 0x03FD5283EF743F98B + .quad 0x03FD533624B59CA35 # 0.331261228165 402 + .quad 0x03FD533624B59CA35 + .quad 0x03FD53E878FFE6EAE # 0.331941500300 403 + .quad 0x03FD53E878FFE6EAE + .quad 0x03FD549AEC5DEF880 # 0.332622235521 404 + .quad 0x03FD549AEC5DEF880 + .quad 0x03FD554D7EDA8D3C4 # 0.333303434457 405 + .quad 0x03FD554D7EDA8D3C4 + .quad 0x03FD560030809C759 # 0.333985097742 406 + .quad 0x03FD560030809C759 + .quad 0x03FD56B3015AFF52C # 0.334667226008 407 + .quad 0x03FD56B3015AFF52C + .quad 0x03FD5765F1749DA6C # 0.335349819892 408 + .quad 0x03FD5765F1749DA6C + .quad 0x03FD581900D864FD7 # 0.336032880027 409 + .quad 0x03FD581900D864FD7 + .quad 0x03FD58CC2F91489F5 # 0.336716407053 410 + .quad 0x03FD58CC2F91489F5 + .quad 0x03FD59AC5618CCE38 # 0.337571473373 411 + .quad 0x03FD59AC5618CCE38 + .quad 0x03FD5A5FCB795780C # 0.338256053239 412 + .quad 0x03FD5A5FCB795780C + .quad 0x03FD5B136052BCE39 # 0.338941102075 413 + .quad 0x03FD5B136052BCE39 + .quad 0x03FD5BC714B008E23 # 0.339626620526 414 + .quad 0x03FD5BC714B008E23 + .quad 0x03FD5C7AE89C4D254 # 0.340312609234 415 + .quad 0x03FD5C7AE89C4D254 + .quad 0x03FD5D2EDC22A12BA # 0.340999068845 416 + .quad 0x03FD5D2EDC22A12BA + .quad 0x03FD5DE2EF4E224D6 # 0.341686000008 417 + .quad 0x03FD5DE2EF4E224D6 + .quad 0x03FD5E972229F3C15 # 0.342373403369 418 + .quad 0x03FD5E972229F3C15 + .quad 0x03FD5F4B74C13EA04 # 0.343061279578 419 + .quad 0x03FD5F4B74C13EA04 + .quad 0x03FD5FFFE71F31E9A # 0.343749629287 420 + .quad 0x03FD5FFFE71F31E9A + .quad 0x03FD60B4794F02875 # 0.344438453147 421 + .quad 0x03FD60B4794F02875 + .quad 0x03FD61692B5BEB520 # 0.345127751813 422 + .quad 0x03FD61692B5BEB520 + .quad 0x03FD621DFD512D14F # 0.345817525940 423 + .quad 0x03FD621DFD512D14F + .quad 0x03FD62D2EF3A0E933 # 0.346507776183 424 + .quad 0x03FD62D2EF3A0E933 + .quad 0x03FD63880121DC8AB # 0.347198503200 425 + .quad 0x03FD63880121DC8AB + .quad 0x03FD643D3313E9B92 # 0.347889707652 426 + .quad 0x03FD643D3313E9B92 + .quad 0x03FD64F2851B8EE01 # 0.348581390197 427 + .quad 0x03FD64F2851B8EE01 + .quad 0x03FD65A7F7442AC90 # 0.349273551498 428 + .quad 0x03FD65A7F7442AC90 + .quad 0x03FD665D8999224A5 # 0.349966192218 429 + .quad 0x03FD665D8999224A5 + .quad 0x03FD67133C25E04A5 # 0.350659313022 430 + .quad 0x03FD67133C25E04A5 + .quad 0x03FD67C90EF5D5C4C # 0.351352914576 431 + .quad 0x03FD67C90EF5D5C4C + .quad 0x03FD687F021479CEE # 0.352046997547 432 + .quad 0x03FD687F021479CEE + .quad 0x03FD6935158D499B3 # 0.352741562603 433 + .quad 0x03FD6935158D499B3 + .quad 0x03FD69EB496BC87E5 # 0.353436610416 434 + .quad 0x03FD69EB496BC87E5 + .quad 0x03FD6AA19DBB7FF34 # 0.354132141656 435 + .quad 0x03FD6AA19DBB7FF34 + .quad 0x03FD6B581287FF9FD # 0.354828156996 436 + .quad 0x03FD6B581287FF9FD + .quad 0x03FD6C0EA7DCDD591 # 0.355524657112 437 + .quad 0x03FD6C0EA7DCDD591 + .quad 0x03FD6C97AD3CFCFD9 # 0.356047350738 438 + .quad 0x03FD6C97AD3CFCFD9 + .quad 0x03FD6D4E7B9C727EC # 0.356744700836 439 + .quad 0x03FD6D4E7B9C727EC + .quad 0x03FD6E056AA4421D6 # 0.357442537571 440 + .quad 0x03FD6E056AA4421D6 + .quad 0x03FD6EBC7A6019066 # 0.358140861621 441 + .quad 0x03FD6EBC7A6019066 + .quad 0x03FD6F73AADBAAAB7 # 0.358839673669 442 + .quad 0x03FD6F73AADBAAAB7 + .quad 0x03FD702AFC22B0C6D # 0.359538974397 443 + .quad 0x03FD702AFC22B0C6D + .quad 0x03FD70E26E40EB5FA # 0.360238764489 444 + .quad 0x03FD70E26E40EB5FA + .quad 0x03FD719A014220CF5 # 0.360939044629 445 + .quad 0x03FD719A014220CF5 + .quad 0x03FD7251B5321DC54 # 0.361639815506 446 + .quad 0x03FD7251B5321DC54 + .quad 0x03FD73098A1CB54BA # 0.362341077807 447 + .quad 0x03FD73098A1CB54BA + .quad 0x03FD73937F783CEBA # 0.362867347444 448 + .quad 0x03FD73937F783CEBA + .quad 0x03FD744B8E35E9EDA # 0.363569471398 449 + .quad 0x03FD744B8E35E9EDA + .quad 0x03FD7503BE0ED6C66 # 0.364272088676 450 + .quad 0x03FD7503BE0ED6C66 + .quad 0x03FD75BC0F0EEE7DE # 0.364975199972 451 + .quad 0x03FD75BC0F0EEE7DE + .quad 0x03FD76748142228C7 # 0.365678805982 452 + .quad 0x03FD76748142228C7 + .quad 0x03FD772D14B46AE00 # 0.366382907402 453 + .quad 0x03FD772D14B46AE00 + .quad 0x03FD77E5C971C5E06 # 0.367087504930 454 + .quad 0x03FD77E5C971C5E06 + .quad 0x03FD787066E04915F # 0.367616279067 455 + .quad 0x03FD787066E04915F + .quad 0x03FD792955FDF47A3 # 0.368321746469 456 + .quad 0x03FD792955FDF47A3 + .quad 0x03FD79E26687CFB3D # 0.369027711906 457 + .quad 0x03FD79E26687CFB3D + .quad 0x03FD7A9B9889F19E2 # 0.369734176082 458 + .quad 0x03FD7A9B9889F19E2 + .quad 0x03FD7B54EC1077A48 # 0.370441139703 459 + .quad 0x03FD7B54EC1077A48 + .quad 0x03FD7C0E612785C74 # 0.371148603475 460 + .quad 0x03FD7C0E612785C74 + .quad 0x03FD7C998F06FB152 # 0.371679529954 461 + .quad 0x03FD7C998F06FB152 + .quad 0x03FD7D533EF841E8A # 0.372387870696 462 + .quad 0x03FD7D533EF841E8A + .quad 0x03FD7E0D109B95F19 # 0.373096713539 463 + .quad 0x03FD7E0D109B95F19 + .quad 0x03FD7EC703FD340AA # 0.373806059198 464 + .quad 0x03FD7EC703FD340AA + .quad 0x03FD7F8119295FB9B # 0.374515908385 465 + .quad 0x03FD7F8119295FB9B + .quad 0x03FD800CBF3ED1CC2 # 0.375048626146 466 + .quad 0x03FD800CBF3ED1CC2 + .quad 0x03FD80C70FAB0BDF6 # 0.375759358229 467 + .quad 0x03FD80C70FAB0BDF6 + .quad 0x03FD81818203AFC7F # 0.376470595813 468 + .quad 0x03FD81818203AFC7F + .quad 0x03FD823C16551A3C3 # 0.377182339615 469 + .quad 0x03FD823C16551A3C3 + .quad 0x03FD82C81BE4DFF4A # 0.377716480107 470 + .quad 0x03FD82C81BE4DFF4A + .quad 0x03FD8382EBC7794D1 # 0.378429111528 471 + .quad 0x03FD8382EBC7794D1 + .quad 0x03FD843DDDC4FB137 # 0.379142251156 472 + .quad 0x03FD843DDDC4FB137 + .quad 0x03FD84F8F1E9DB72B # 0.379855899714 473 + .quad 0x03FD84F8F1E9DB72B + .quad 0x03FD85855776DCBFB # 0.380391470556 474 + .quad 0x03FD85855776DCBFB + .quad 0x03FD8640A77EB3957 # 0.381106011494 475 + .quad 0x03FD8640A77EB3957 + .quad 0x03FD86FC19D05148E # 0.381821063366 476 + .quad 0x03FD86FC19D05148E + .quad 0x03FD87B7AE7845C0F # 0.382536626902 477 + .quad 0x03FD87B7AE7845C0F + .quad 0x03FD8844748678822 # 0.383073635776 478 + .quad 0x03FD8844748678822 + .quad 0x03FD89004563D3DFD # 0.383790096491 479 + .quad 0x03FD89004563D3DFD + .quad 0x03FD89BC38BA356B4 # 0.384507070890 480 + .quad 0x03FD89BC38BA356B4 + .quad 0x03FD8A4945E20894E # 0.385045139237 481 + .quad 0x03FD8A4945E20894E + .quad 0x03FD8B0575AAB1FC5 # 0.385763014358 482 + .quad 0x03FD8B0575AAB1FC5 + .quad 0x03FD8BC1C80F45A32 # 0.386481405193 483 + .quad 0x03FD8BC1C80F45A32 + .quad 0x03FD8C7E3D1C80B2F # 0.387200312485 484 + .quad 0x03FD8C7E3D1C80B2F + .quad 0x03FD8D0BABACC89EE # 0.387739832326 485 + .quad 0x03FD8D0BABACC89EE + .quad 0x03FD8DC85D7FE5013 # 0.388459645206 486 + .quad 0x03FD8DC85D7FE5013 + .quad 0x03FD8E85321ED5598 # 0.389179976589 487 + .quad 0x03FD8E85321ED5598 + .quad 0x03FD8F12E873862C7 # 0.389720565845 488 + .quad 0x03FD8F12E873862C7 + .quad 0x03FD8FCFFA1614AA0 # 0.390441806410 489 + .quad 0x03FD8FCFFA1614AA0 + .quad 0x03FD908D2EA7D9511 # 0.391163567538 490 + .quad 0x03FD908D2EA7D9511 + .quad 0x03FD911B2D09ED9D6 # 0.391705230456 491 + .quad 0x03FD911B2D09ED9D6 + .quad 0x03FD91D89EDD6B7FF # 0.392427904381 492 + .quad 0x03FD91D89EDD6B7FF + .quad 0x03FD929633C3B7D3E # 0.393151100941 493 + .quad 0x03FD929633C3B7D3E + .quad 0x03FD93247A7C99B52 # 0.393693841796 494 + .quad 0x03FD93247A7C99B52 + .quad 0x03FD93E24CE3195E8 # 0.394417954789 495 + .quad 0x03FD93E24CE3195E8 + .quad 0x03FD9470C1CB1962E # 0.394961383840 496 + .quad 0x03FD9470C1CB1962E + .quad 0x03FD952ED1D9C0435 # 0.395686415592 497 + .quad 0x03FD952ED1D9C0435 + .quad 0x03FD95ED0535EA5D9 # 0.396411973396 498 + .quad 0x03FD95ED0535EA5D9 + .quad 0x03FD967BC2EDCCE17 # 0.396956487431 499 + .quad 0x03FD967BC2EDCCE17 + .quad 0x03FD973A3431356AE # 0.397682967666 500 + .quad 0x03FD973A3431356AE + .quad 0x03FD97F8C8E64A1C7 # 0.398409976059 501 + .quad 0x03FD97F8C8E64A1C7 + .quad 0x03FD9887CFB8A3932 # 0.398955579419 502 + .quad 0x03FD9887CFB8A3932 + .quad 0x03FD9946A2946EF3C # 0.399683513937 503 + .quad 0x03FD9946A2946EF3C + .quad 0x03FD99D5D8130607C # 0.400229812776 504 + .quad 0x03FD99D5D8130607C + .quad 0x03FD9A94E93E1EC37 # 0.400958675782 505 + .quad 0x03FD9A94E93E1EC37 + .quad 0x03FD9B244D87735E8 # 0.401505671875 506 + .quad 0x03FD9B244D87735E8 + .quad 0x03FD9BE39D2A97F0B # 0.402235465741 507 + .quad 0x03FD9BE39D2A97F0B + .quad 0x03FD9CA3109266E23 # 0.402965792595 508 + .quad 0x03FD9CA3109266E23 + .quad 0x03FD9D32BEA15ED3A # 0.403513887977 509 + .quad 0x03FD9D32BEA15ED3A + .quad 0x03FD9DF270C1914A8 # 0.404245149435 510 + .quad 0x03FD9DF270C1914A8 + .quad 0x03FD9E824DEA3E135 # 0.404793946669 511 + .quad 0x03FD9E824DEA3E135 + .quad 0x03FD9F423EEBF9DA1 # 0.405526145127 512 + .quad 0x03FD9F423EEBF9DA1 + .quad 0x03FD9FD24B4D47012 # 0.406075646011 513 + .quad 0x03FD9FD24B4D47012 + .quad 0x03FDA0927B59DA6E2 # 0.406808783874 514 + .quad 0x03FDA0927B59DA6E2 + .quad 0x03FDA152CF7F3B46D # 0.407542459622 515 + .quad 0x03FDA152CF7F3B46D + .quad 0x03FDA1E32653B420E # 0.408093069896 516 + .quad 0x03FDA1E32653B420E + .quad 0x03FDA2A3B9C527DB1 # 0.408827688845 517 + .quad 0x03FDA2A3B9C527DB1 + .quad 0x03FDA33440224FA79 # 0.409379007429 518 + .quad 0x03FDA33440224FA79 + .quad 0x03FDA3F513098DD09 # 0.410114572008 519 + .quad 0x03FDA3F513098DD09 + .quad 0x03FDA485C90EBDB0C # 0.410666600728 520 + .quad 0x03FDA485C90EBDB0C + .quad 0x03FDA546DB95A721A # 0.411403113374 521 + .quad 0x03FDA546DB95A721A + .quad 0x03FDA5D7C16257437 # 0.411955854060 522 + .quad 0x03FDA5D7C16257437 + .quad 0x03FDA69913B2F6572 # 0.412693317221 523 + .quad 0x03FDA69913B2F6572 + .quad 0x03FDA72A2966BE1EA # 0.413246771713 524 + .quad 0x03FDA72A2966BE1EA + .quad 0x03FDA7EBBBAB46E8B # 0.413985187844 525 + .quad 0x03FDA7EBBBAB46E8B + .quad 0x03FDA87D0165DD199 # 0.414539357989 526 + .quad 0x03FDA87D0165DD199 + .quad 0x03FDA93ED3C8AD9E3 # 0.415278729556 527 + .quad 0x03FDA93ED3C8AD9E3 + .quad 0x03FDA9D049A9E884A # 0.415833617206 528 + .quad 0x03FDA9D049A9E884A + .quad 0x03FDAA925C5588EFA # 0.416573946686 529 + .quad 0x03FDAA925C5588EFA + .quad 0x03FDAB24027D5E8AF # 0.417129553701 530 + .quad 0x03FDAB24027D5E8AF + .quad 0x03FDABE6559C8167C # 0.417870843580 531 + .quad 0x03FDABE6559C8167C + .quad 0x03FDAC782C2B07944 # 0.418427171828 532 + .quad 0x03FDAC782C2B07944 + .quad 0x03FDAD3ABFE88A06E # 0.419169424599 533 + .quad 0x03FDAD3ABFE88A06E + .quad 0x03FDADCCC6FDF6A80 # 0.419726475955 534 + .quad 0x03FDADCCC6FDF6A80 + .quad 0x03FDAE5EE2E961227 # 0.420283837790 535 + .quad 0x03FDAE5EE2E961227 + .quad 0x03FDAF21D34189D0A # 0.421027470470 536 + .quad 0x03FDAF21D34189D0A + .quad 0x03FDAFB41FE2167B4 # 0.421585558104 537 + .quad 0x03FDAFB41FE2167B4 + .quad 0x03FDB07751416A7F3 # 0.422330159776 538 + .quad 0x03FDB07751416A7F3 + .quad 0x03FDB109CEB79DB8A # 0.422888975102 539 + .quad 0x03FDB109CEB79DB8A + .quad 0x03FDB1CD41498DF12 # 0.423634548296 540 + .quad 0x03FDB1CD41498DF12 + .quad 0x03FDB25FEFB60CB2E # 0.424194093214 541 + .quad 0x03FDB25FEFB60CB2E + .quad 0x03FDB323A3A63594A # 0.424940640468 542 + .quad 0x03FDB323A3A63594A + .quad 0x03FDB3B68329C59E9 # 0.425500916886 543 + .quad 0x03FDB3B68329C59E9 + .quad 0x03FDB44977C148F1A # 0.426061507389 544 + .quad 0x03FDB44977C148F1A + .quad 0x03FDB50D895F7773A # 0.426809450580 545 + .quad 0x03FDB50D895F7773A + .quad 0x03FDB5A0AF3D169CD # 0.427370775322 546 + .quad 0x03FDB5A0AF3D169CD + .quad 0x03FDB66502A41E541 # 0.428119698779 547 + .quad 0x03FDB66502A41E541 + .quad 0x03FDB6F859E8EF639 # 0.428681759684 548 + .quad 0x03FDB6F859E8EF639 + .quad 0x03FDB78BC664238C0 # 0.429244136679 549 + .quad 0x03FDB78BC664238C0 + .quad 0x03FDB85078123E586 # 0.429994464983 550 + .quad 0x03FDB85078123E586 + .quad 0x03FDB8E41624226C5 # 0.430557580905 551 + .quad 0x03FDB8E41624226C5 + .quad 0x03FDB9A90A06BCB3D # 0.431308895742 552 + .quad 0x03FDB9A90A06BCB3D + .quad 0x03FDBA3CD9D0B81BD # 0.431872752537 553 + .quad 0x03FDBA3CD9D0B81BD + .quad 0x03FDBAD0BEF3DB164 # 0.432436927446 554 + .quad 0x03FDBAD0BEF3DB164 + .quad 0x03FDBB9611B80E2FC # 0.433189656123 555 + .quad 0x03FDBB9611B80E2FC + .quad 0x03FDBC2A28C33B75D # 0.433754574696 556 + .quad 0x03FDBC2A28C33B75D + .quad 0x03FDBCBE553C2BDDF # 0.434319812582 557 + .quad 0x03FDBCBE553C2BDDF + .quad 0x03FDBD84073D8EC2B # 0.435073960430 558 + .quad 0x03FDBD84073D8EC2B + .quad 0x03FDBE1865CEC1EC9 # 0.435639944787 559 + .quad 0x03FDBE1865CEC1EC9 + .quad 0x03FDBEACD9E271AD1 # 0.436206249662 560 + .quad 0x03FDBEACD9E271AD1 + .quad 0x03FDBF72EB7D20355 # 0.436961822044 561 + .quad 0x03FDBF72EB7D20355 + .quad 0x03FDC00791D99132B # 0.437528876213 562 + .quad 0x03FDC00791D99132B + .quad 0x03FDC09C4DCD565AB # 0.438096252115 563 + .quad 0x03FDC09C4DCD565AB + .quad 0x03FDC162BF5DF23E4 # 0.438853254422 564 + .quad 0x03FDC162BF5DF23E4 + .quad 0x03FDC1F7ADCB3DAB0 # 0.439421382456 565 + .quad 0x03FDC1F7ADCB3DAB0 + .quad 0x03FDC28CB1E4D32FD # 0.439989833442 566 + .quad 0x03FDC28CB1E4D32FD + .quad 0x03FDC35383C8850B0 # 0.440748271097 567 + .quad 0x03FDC35383C8850B0 + .quad 0x03FDC3E8BA8CACF27 # 0.441317477070 568 + .quad 0x03FDC3E8BA8CACF27 + .quad 0x03FDC47E071233744 # 0.441887007223 569 + .quad 0x03FDC47E071233744 + .quad 0x03FDC54539A6ABCD2 # 0.442646885679 570 + .quad 0x03FDC54539A6ABCD2 + .quad 0x03FDC5DAB908186FF # 0.443217173690 571 + .quad 0x03FDC5DAB908186FF + .quad 0x03FDC6704E4016FF7 # 0.443787787115 572 + .quad 0x03FDC6704E4016FF7 + .quad 0x03FDC737E1E38F4FB # 0.444549111857 573 + .quad 0x03FDC737E1E38F4FB + .quad 0x03FDC7CDAA290FEAD # 0.445120486027 574 + .quad 0x03FDC7CDAA290FEAD + .quad 0x03FDC863885A74D16 # 0.445692186852 575 + .quad 0x03FDC863885A74D16 + .quad 0x03FDC8F97C7E299DB # 0.446264214707 576 + .quad 0x03FDC8F97C7E299DB + .quad 0x03FDC9C18EDC7C26B # 0.447027427871 577 + .quad 0x03FDC9C18EDC7C26B + .quad 0x03FDCA57B64E9DB05 # 0.447600220249 578 + .quad 0x03FDCA57B64E9DB05 + .quad 0x03FDCAEDF3C88A364 # 0.448173340907 579 + .quad 0x03FDCAEDF3C88A364 + .quad 0x03FDCB844750B9995 # 0.448746790220 580 + .quad 0x03FDCB844750B9995 + .quad 0x03FDCC4CD90B3ECE5 # 0.449511901199 581 + .quad 0x03FDCC4CD90B3ECE5 + .quad 0x03FDCCE3602341C10 # 0.450086118843 582 + .quad 0x03FDCCE3602341C10 + .quad 0x03FDCD79FD5F2BC77 # 0.450660666403 583 + .quad 0x03FDCD79FD5F2BC77 + .quad 0x03FDCE10B0C581284 # 0.451235544257 584 + .quad 0x03FDCE10B0C581284 + .quad 0x03FDCED9C27EC6607 # 0.452002562511 585 + .quad 0x03FDCED9C27EC6607 + .quad 0x03FDCF70A9B6D3810 # 0.452578212532 586 + .quad 0x03FDCF70A9B6D3810 + .quad 0x03FDD007A72F19BBC # 0.453154194116 587 + .quad 0x03FDD007A72F19BBC + .quad 0x03FDD09EBAEE29DD8 # 0.453730507647 588 + .quad 0x03FDD09EBAEE29DD8 + .quad 0x03FDD1684D49F46AE # 0.454499442710 589 + .quad 0x03FDD1684D49F46AE + .quad 0x03FDD1FF951D1F1B3 # 0.455076532271 590 + .quad 0x03FDD1FF951D1F1B3 + .quad 0x03FDD296F34D0B65C # 0.455653955057 591 + .quad 0x03FDD296F34D0B65C + .quad 0x03FDD32E67E056BD5 # 0.456231711452 592 + .quad 0x03FDD32E67E056BD5 + .quad 0x03FDD3C5F2DDA1840 # 0.456809801843 593 + .quad 0x03FDD3C5F2DDA1840 + .quad 0x03FDD490246DEFA6A # 0.457581109247 594 + .quad 0x03FDD490246DEFA6A + .quad 0x03FDD527E3D1B95FC # 0.458159980465 595 + .quad 0x03FDD527E3D1B95FC + .quad 0x03FDD5BFB9B5AE71F # 0.458739186968 596 + .quad 0x03FDD5BFB9B5AE71F + .quad 0x03FDD657A6207C0DB # 0.459318729146 597 + .quad 0x03FDD657A6207C0DB + .quad 0x03FDD6EFA918D25CE # 0.459898607388 598 + .quad 0x03FDD6EFA918D25CE + .quad 0x03FDD7BA7AD9E7DA1 # 0.460672301817 599 + .quad 0x03FDD7BA7AD9E7DA1 + .quad 0x03FDD852B28BE5A0F # 0.461252965726 600 + .quad 0x03FDD852B28BE5A0F + .quad 0x03FDD8EB00E1CCE14 # 0.461833967001 601 + .quad 0x03FDD8EB00E1CCE14 + .quad 0x03FDD98365E25ABB9 # 0.462415306035 602 + .quad 0x03FDD98365E25ABB9 + .quad 0x03FDDA1BE1944F538 # 0.462996983220 603 + .quad 0x03FDDA1BE1944F538 + .quad 0x03FDDAE75484C9615 # 0.463773079495 604 + .quad 0x03FDDAE75484C9615 + .quad 0x03FDDB8005445488B # 0.464355547233 605 + .quad 0x03FDDB8005445488B + .quad 0x03FDDC18CCCBDCB83 # 0.464938354438 606 + .quad 0x03FDDC18CCCBDCB83 + .quad 0x03FDDCB1AB222F33D # 0.465521501504 607 + .quad 0x03FDDCB1AB222F33D + .quad 0x03FDDD4AA04E1C4B7 # 0.466104988830 608 + .quad 0x03FDDD4AA04E1C4B7 + .quad 0x03FDDDE3AC56775D2 # 0.466688816812 609 + .quad 0x03FDDDE3AC56775D2 + .quad 0x03FDDE7CCF4216D6E # 0.467272985848 610 + .quad 0x03FDDE7CCF4216D6E + .quad 0x03FDDF492177D7BBC # 0.468052409114 611 + .quad 0x03FDDF492177D7BBC + .quad 0x03FDDFE279E5BF4EE # 0.468637375496 612 + .quad 0x03FDDFE279E5BF4EE + .quad 0x03FDE07BE94DCC439 # 0.469222684263 613 + .quad 0x03FDE07BE94DCC439 + .quad 0x03FDE1156FB6E2626 # 0.469808335817 614 + .quad 0x03FDE1156FB6E2626 + .quad 0x03FDE1AF0D27E88D7 # 0.470394330560 615 + .quad 0x03FDE1AF0D27E88D7 + .quad 0x03FDE248C1A7C8C26 # 0.470980668894 616 + .quad 0x03FDE248C1A7C8C26 + .quad 0x03FDE2E28D3D701CC # 0.471567351222 617 + .quad 0x03FDE2E28D3D701CC + .quad 0x03FDE37C6FEFCED73 # 0.472154377948 618 + .quad 0x03FDE37C6FEFCED73 + .quad 0x03FDE449C232C39D8 # 0.472937616681 619 + .quad 0x03FDE449C232C39D8 + .quad 0x03FDE4E3DAEDDB5F6 # 0.473525448578 620 + .quad 0x03FDE4E3DAEDDB5F6 + .quad 0x03FDE57E0ADCE1EA5 # 0.474113626224 621 + .quad 0x03FDE57E0ADCE1EA5 + .quad 0x03FDE6185206D516F # 0.474702150027 622 + .quad 0x03FDE6185206D516F + .quad 0x03FDE6B2B072B5E6F # 0.475291020395 623 + .quad 0x03FDE6B2B072B5E6F + .quad 0x03FDE74D26278887A # 0.475880237735 624 + .quad 0x03FDE74D26278887A + .quad 0x03FDE7E7B32C5453F # 0.476469802457 625 + .quad 0x03FDE7E7B32C5453F + .quad 0x03FDE882578823D52 # 0.477059714970 626 + .quad 0x03FDE882578823D52 + .quad 0x03FDE91D134204C67 # 0.477649975686 627 + .quad 0x03FDE91D134204C67 + .quad 0x03FDE9B7E6610815A # 0.478240585015 628 + .quad 0x03FDE9B7E6610815A + .quad 0x03FDEA52D0EC41E5E # 0.478831543369 629 + .quad 0x03FDEA52D0EC41E5E + .quad 0x03FDEB218376ECFC0 # 0.479620031484 630 + .quad 0x03FDEB218376ECFC0 + .quad 0x03FDEBBCA4C4E9E87 # 0.480211805838 631 + .quad 0x03FDEBBCA4C4E9E87 + .quad 0x03FDEC57DD96CD0CB # 0.480803930597 632 + .quad 0x03FDEC57DD96CD0CB + .quad 0x03FDECF32DF3B887D # 0.481396406174 633 + .quad 0x03FDECF32DF3B887D + .quad 0x03FDED8E95E2D1B88 # 0.481989232987 634 + .quad 0x03FDED8E95E2D1B88 + .quad 0x03FDEE2A156B413E5 # 0.482582411453 635 + .quad 0x03FDEE2A156B413E5 + .quad 0x03FDEEC5AC9432FCB # 0.483175941987 636 + .quad 0x03FDEEC5AC9432FCB + .quad 0x03FDEF615B64D61C7 # 0.483769825010 637 + .quad 0x03FDEF615B64D61C7 + .quad 0x03FDEFFD21E45D0D1 # 0.484364060939 638 + .quad 0x03FDEFFD21E45D0D1 + .quad 0x03FDF0990019FD887 # 0.484958650194 639 + .quad 0x03FDF0990019FD887 + .quad 0x03FDF134F60CF092D # 0.485553593197 640 + .quad 0x03FDF134F60CF092D + .quad 0x03FDF1D103C4727E4 # 0.486148890367 641 + .quad 0x03FDF1D103C4727E4 + .quad 0x03FDF26D2947C2EC5 # 0.486744542127 642 + .quad 0x03FDF26D2947C2EC5 + .quad 0x03FDF309669E24CF9 # 0.487340548899 643 + .quad 0x03FDF309669E24CF9 + .quad 0x03FDF3A5BBCEDE6E1 # 0.487936911107 644 + .quad 0x03FDF3A5BBCEDE6E1 + .quad 0x03FDF44228E13963A # 0.488533629176 645 + .quad 0x03FDF44228E13963A + .quad 0x03FDF4DEADDC82A35 # 0.489130703529 646 + .quad 0x03FDF4DEADDC82A35 + .quad 0x03FDF57B4AC80A79A # 0.489728134594 647 + .quad 0x03FDF57B4AC80A79A + .quad 0x03FDF617FFAB248ED # 0.490325922795 648 + .quad 0x03FDF617FFAB248ED + .quad 0x03FDF6B4CC8D27E87 # 0.490924068561 649 + .quad 0x03FDF6B4CC8D27E87 + .quad 0x03FDF751B1756EEC8 # 0.491522572320 650 + .quad 0x03FDF751B1756EEC8 + .quad 0x03FDF7EEAE6B5761C # 0.492121434499 651 + .quad 0x03FDF7EEAE6B5761C + .quad 0x03FDF88BC3764273B # 0.492720655530 652 + .quad 0x03FDF88BC3764273B + .quad 0x03FDF928F09D94B32 # 0.493320235842 653 + .quad 0x03FDF928F09D94B32 + .quad 0x03FDF9C635E8B6192 # 0.493920175866 654 + .quad 0x03FDF9C635E8B6192 + .quad 0x03FDFA63935F1208C # 0.494520476034 655 + .quad 0x03FDFA63935F1208C + .quad 0x03FDFB0109081751A # 0.495121136779 656 + .quad 0x03FDFB0109081751A + .quad 0x03FDFB9E96EB38311 # 0.495722158534 657 + .quad 0x03FDFB9E96EB38311 + .quad 0x03FDFC3C3D0FEA555 # 0.496323541733 658 + .quad 0x03FDFC3C3D0FEA555 + .quad 0x03FDFCD9FB7DA6DEF # 0.496925286812 659 + .quad 0x03FDFCD9FB7DA6DEF + .quad 0x03FDFD77D23BEA634 # 0.497527394206 660 + .quad 0x03FDFD77D23BEA634 + .quad 0x03FDFE15C15234EE2 # 0.498129864352 661 + .quad 0x03FDFE15C15234EE2 + .quad 0x03FDFEB3C8C80A04E # 0.498732697687 662 + .quad 0x03FDFEB3C8C80A04E + .quad 0x03FDFF51E8A4F0A74 # 0.499335894649 663 + .quad 0x03FDFF51E8A4F0A74 + .quad 0x03FDFFF020F07352E # 0.499939455677 664 + .quad 0x03FDFFF020F07352E + .quad 0x03FE004738D910023 # 0.500543381211 665 + .quad 0x03FE004738D910023 + .quad 0x03FE00966D78C41CF # 0.501147671692 666 + .quad 0x03FE00966D78C41CF + .quad 0x03FE00E5AE5B207AB # 0.501752327560 667 + .quad 0x03FE00E5AE5B207AB + .quad 0x03FE011A8B18F0ED6 # 0.502155634684 668 + .quad 0x03FE011A8B18F0ED6 + .quad 0x03FE0169E072D7311 # 0.502760900515 669 + .quad 0x03FE0169E072D7311 + .quad 0x03FE01B942198A5A1 # 0.503366532915 670 + .quad 0x03FE01B942198A5A1 + .quad 0x03FE0208B010DB642 # 0.503972532327 671 + .quad 0x03FE0208B010DB642 + .quad 0x03FE02582A5C9D122 # 0.504578899198 672 + .quad 0x03FE02582A5C9D122 + .quad 0x03FE02A7B100A3EF0 # 0.505185633972 673 + .quad 0x03FE02A7B100A3EF0 + .quad 0x03FE02F74400C64EA # 0.505792737097 674 + .quad 0x03FE02F74400C64EA + .quad 0x03FE0346E360DC4F9 # 0.506400209020 675 + .quad 0x03FE0346E360DC4F9 + .quad 0x03FE03968F24BFDB6 # 0.507008050190 676 + .quad 0x03FE03968F24BFDB6 + .quad 0x03FE03E647504CA89 # 0.507616261055 677 + .quad 0x03FE03E647504CA89 + .quad 0x03FE04360BE7603AE # 0.508224842066 678 + .quad 0x03FE04360BE7603AE + .quad 0x03FE046B4089BE0FD # 0.508630768599 679 + .quad 0x03FE046B4089BE0FD + .quad 0x03FE04BB19DCA36B3 # 0.509239967521 680 + .quad 0x03FE04BB19DCA36B3 + .quad 0x03FE050AFFA5671A5 # 0.509849537793 681 + .quad 0x03FE050AFFA5671A5 + .quad 0x03FE055AF1E7ED47B # 0.510459479867 682 + .quad 0x03FE055AF1E7ED47B + .quad 0x03FE05AAF0A81BF04 # 0.511069794198 683 + .quad 0x03FE05AAF0A81BF04 + .quad 0x03FE05FAFBE9DAE58 # 0.511680481240 684 + .quad 0x03FE05FAFBE9DAE58 + .quad 0x03FE064B13B113CDD # 0.512291541448 685 + .quad 0x03FE064B13B113CDD + .quad 0x03FE069B3801B2263 # 0.512902975280 686 + .quad 0x03FE069B3801B2263 + .quad 0x03FE06D0AC85B63A2 # 0.513310805628 687 + .quad 0x03FE06D0AC85B63A2 + .quad 0x03FE0720E5C40DF1D # 0.513922863181 688 + .quad 0x03FE0720E5C40DF1D + .quad 0x03FE07712B9648153 # 0.514535295577 689 + .quad 0x03FE07712B9648153 + .quad 0x03FE07C17E0056E7C # 0.515148103277 690 + .quad 0x03FE07C17E0056E7C + .quad 0x03FE0811DD062E889 # 0.515761286740 691 + .quad 0x03FE0811DD062E889 + .quad 0x03FE086248ABC4F3B # 0.516374846428 692 + .quad 0x03FE086248ABC4F3B + .quad 0x03FE08B2C0F512033 # 0.516988782802 693 + .quad 0x03FE08B2C0F512033 + .quad 0x03FE08E86D82DA3EE # 0.517398283218 694 + .quad 0x03FE08E86D82DA3EE + .quad 0x03FE0938FAE5D8E9B # 0.518012848432 695 + .quad 0x03FE0938FAE5D8E9B + .quad 0x03FE098994F72C539 # 0.518627791569 696 + .quad 0x03FE098994F72C539 + .quad 0x03FE09DA3BBAD339C # 0.519243113094 697 + .quad 0x03FE09DA3BBAD339C + .quad 0x03FE0A2AEF34CE3D1 # 0.519858813473 698 + .quad 0x03FE0A2AEF34CE3D1 + .quad 0x03FE0A7BAF691FE34 # 0.520474893172 699 + .quad 0x03FE0A7BAF691FE34 + .quad 0x03FE0AB18BF5823C3 # 0.520885823936 700 + .quad 0x03FE0AB18BF5823C3 + .quad 0x03FE0B02616952989 # 0.521502536876 701 + .quad 0x03FE0B02616952989 + .quad 0x03FE0B5343A234476 # 0.522119630385 702 + .quad 0x03FE0B5343A234476 + .quad 0x03FE0BA432A430CA2 # 0.522737104934 703 + .quad 0x03FE0BA432A430CA2 + .quad 0x03FE0BF52E73538CE # 0.523354960993 704 + .quad 0x03FE0BF52E73538CE + .quad 0x03FE0C463713A9E6F # 0.523973199034 705 + .quad 0x03FE0C463713A9E6F + .quad 0x03FE0C7C43F4C861E # 0.524385570174 706 + .quad 0x03FE0C7C43F4C861E + .quad 0x03FE0CCD61FAD07D2 # 0.525004445903 707 + .quad 0x03FE0CCD61FAD07D2 + .quad 0x03FE0D1E8CDCE3DB6 # 0.525623704876 708 + .quad 0x03FE0D1E8CDCE3DB6 + .quad 0x03FE0D6FC49F16E93 # 0.526243347569 709 + .quad 0x03FE0D6FC49F16E93 + .quad 0x03FE0DC109458004A # 0.526863374456 710 + .quad 0x03FE0DC109458004A + .quad 0x03FE0DF73E353F0ED # 0.527276939392 711 + .quad 0x03FE0DF73E353F0ED + .quad 0x03FE0E4898611CCE1 # 0.527897607665 712 + .quad 0x03FE0E4898611CCE1 + .quad 0x03FE0E99FF7C20738 # 0.528518661406 713 + .quad 0x03FE0E99FF7C20738 + .quad 0x03FE0EEB738A67874 # 0.529140101094 714 + .quad 0x03FE0EEB738A67874 + .quad 0x03FE0F21C81D1ADC3 # 0.529554608872 715 + .quad 0x03FE0F21C81D1ADC3 + .quad 0x03FE0F7351C9FCD7F # 0.530176692874 716 + .quad 0x03FE0F7351C9FCD7F + .quad 0x03FE0FC4E875254C1 # 0.530799164104 717 + .quad 0x03FE0FC4E875254C1 + .quad 0x03FE10168C22B8FB9 # 0.531422023047 718 + .quad 0x03FE10168C22B8FB9 + .quad 0x03FE10683CD6DEA54 # 0.532045270185 719 + .quad 0x03FE10683CD6DEA54 + .quad 0x03FE109EB9E2E4C97 # 0.532460984179 720 + .quad 0x03FE109EB9E2E4C97 + .quad 0x03FE10F08055E7785 # 0.533084879385 721 + .quad 0x03FE10F08055E7785 + .quad 0x03FE114253DA97DA0 # 0.533709164079 722 + .quad 0x03FE114253DA97DA0 + .quad 0x03FE1194347523FDC # 0.534333838748 723 + .quad 0x03FE1194347523FDC + .quad 0x03FE11CAD1789B0F8 # 0.534750505421 724 + .quad 0x03FE11CAD1789B0F8 + .quad 0x03FE121CC7EB8F7E6 # 0.535375831132 725 + .quad 0x03FE121CC7EB8F7E6 + .quad 0x03FE126ECB7F8F007 # 0.536001548120 726 + .quad 0x03FE126ECB7F8F007 + .quad 0x03FE12A57FDA37091 # 0.536418910396 727 + .quad 0x03FE12A57FDA37091 + .quad 0x03FE12F799594EFBC # 0.537045280601 728 + .quad 0x03FE12F799594EFBC + .quad 0x03FE1349C004AFB00 # 0.537672043392 729 + .quad 0x03FE1349C004AFB00 + .quad 0x03FE139BF3E094003 # 0.538299199261 730 + .quad 0x03FE139BF3E094003 + .quad 0x03FE13D2C873C5E13 # 0.538717521794 731 + .quad 0x03FE13D2C873C5E13 + .quad 0x03FE142512549C16C # 0.539345333889 732 + .quad 0x03FE142512549C16C + .quad 0x03FE14776971477F1 # 0.539973540381 733 + .quad 0x03FE14776971477F1 + .quad 0x03FE14C9CDCE0A74D # 0.540602141763 734 + .quad 0x03FE14C9CDCE0A74D + .quad 0x03FE1500C2BFD1561 # 0.541021428981 735 + .quad 0x03FE1500C2BFD1561 + .quad 0x03FE15533D3B8D7B3 # 0.541650689621 736 + .quad 0x03FE15533D3B8D7B3 + .quad 0x03FE15A5C502C6DC5 # 0.542280346478 737 + .quad 0x03FE15A5C502C6DC5 + .quad 0x03FE15DCD1973457B # 0.542700338085 738 + .quad 0x03FE15DCD1973457B + .quad 0x03FE162F6F9071F76 # 0.543330656416 739 + .quad 0x03FE162F6F9071F76 + .quad 0x03FE16821AE0A13C6 # 0.543961372300 740 + .quad 0x03FE16821AE0A13C6 + .quad 0x03FE16B93F2C12808 # 0.544382070665 741 + .quad 0x03FE16B93F2C12808 + .quad 0x03FE170C00C169B51 # 0.545013450251 742 + .quad 0x03FE170C00C169B51 + .quad 0x03FE175ECFB935CC6 # 0.545645228728 743 + .quad 0x03FE175ECFB935CC6 + .quad 0x03FE17B1AC17CBD5B # 0.546277406602 744 + .quad 0x03FE17B1AC17CBD5B + .quad 0x03FE17E8F12052E8A # 0.546699080654 745 + .quad 0x03FE17E8F12052E8A + .quad 0x03FE183BE3DE8A7AF # 0.547331925312 746 + .quad 0x03FE183BE3DE8A7AF + .quad 0x03FE188EE40F23CA7 # 0.547965170715 747 + .quad 0x03FE188EE40F23CA7 + .quad 0x03FE18C640FF75F06 # 0.548387557205 748 + .quad 0x03FE18C640FF75F06 + .quad 0x03FE191957A30FA51 # 0.549021471648 749 + .quad 0x03FE191957A30FA51 + .quad 0x03FE196C7BC4B1F3A # 0.549655788193 750 + .quad 0x03FE196C7BC4B1F3A + .quad 0x03FE19A3F0B1860BD # 0.550078889532 751 + .quad 0x03FE19A3F0B1860BD + .quad 0x03FE19F72B59A0CEC # 0.550713877383 752 + .quad 0x03FE19F72B59A0CEC + .quad 0x03FE1A4A738B7A33C # 0.551349268700 753 + .quad 0x03FE1A4A738B7A33C + .quad 0x03FE1A820089A2156 # 0.551773087312 754 + .quad 0x03FE1A820089A2156 + .quad 0x03FE1AD55F55855C8 # 0.552409152212 755 + .quad 0x03FE1AD55F55855C8 + .quad 0x03FE1B28CBB6EC93E # 0.553045621948 756 + .quad 0x03FE1B28CBB6EC93E + .quad 0x03FE1B6070DB553D8 # 0.553470160269 757 + .quad 0x03FE1B6070DB553D8 + .quad 0x03FE1BB3F3EA714F6 # 0.554107305878 758 + .quad 0x03FE1BB3F3EA714F6 + .quad 0x03FE1BEBA8316EF2C # 0.554532295260 759 + .quad 0x03FE1BEBA8316EF2C + .quad 0x03FE1C3F41FA97C6B # 0.555170118179 760 + .quad 0x03FE1C3F41FA97C6B + .quad 0x03FE1C92E96C86020 # 0.555808348176 761 + .quad 0x03FE1C92E96C86020 + .quad 0x03FE1CCAB5FBFFEE1 # 0.556234061252 762 + .quad 0x03FE1CCAB5FBFFEE1 + .quad 0x03FE1D1E743BCFC47 # 0.556872970868 763 + .quad 0x03FE1D1E743BCFC47 + .quad 0x03FE1D72403052E75 # 0.557512288951 764 + .quad 0x03FE1D72403052E75 + .quad 0x03FE1DAA251D7E433 # 0.557938728190 765 + .quad 0x03FE1DAA251D7E433 + .quad 0x03FE1DFE07F3D1DAB # 0.558578728212 766 + .quad 0x03FE1DFE07F3D1DAB + .quad 0x03FE1E35FC265D75E # 0.559005622562 767 + .quad 0x03FE1E35FC265D75E + .quad 0x03FE1E89F5EB04126 # 0.559646305979 768 + .quad 0x03FE1E89F5EB04126 + .quad 0x03FE1EDDFD77E1FEF # 0.560287400135 769 + .quad 0x03FE1EDDFD77E1FEF + .quad 0x03FE1F160A2AD0DA3 # 0.560715024687 770 + .quad 0x03FE1F160A2AD0DA3 + .quad 0x03FE1F6A28BA1B476 # 0.561356804579 771 + .quad 0x03FE1F6A28BA1B476 + .quad 0x03FE1FBE551DB43C1 # 0.561998996616 772 + .quad 0x03FE1FBE551DB43C1 + .quad 0x03FE1FF67A6684F47 # 0.562427353873 773 + .quad 0x03FE1FF67A6684F47 + .quad 0x03FE204ABDE0BE5DF # 0.563070233998 774 + .quad 0x03FE204ABDE0BE5DF + .quad 0x03FE2082F29233211 # 0.563499050471 775 + .quad 0x03FE2082F29233211 + .quad 0x03FE20D74D2FBAFE4 # 0.564142620160 776 + .quad 0x03FE20D74D2FBAFE4 + .quad 0x03FE210F91524B469 # 0.564571896835 777 + .quad 0x03FE210F91524B469 + .quad 0x03FE2164031FDA0B0 # 0.565216157568 778 + .quad 0x03FE2164031FDA0B0 + .quad 0x03FE21B882DD26040 # 0.565860833641 779 + .quad 0x03FE21B882DD26040 + .quad 0x03FE21F0DFC65CEEC # 0.566290848698 780 + .quad 0x03FE21F0DFC65CEEC + .quad 0x03FE224576C81FFE0 # 0.566936218194 781 + .quad 0x03FE224576C81FFE0 + .quad 0x03FE227DE33896A44 # 0.567366696031 782 + .quad 0x03FE227DE33896A44 + .quad 0x03FE22D2918BA4A31 # 0.568012760445 783 + .quad 0x03FE22D2918BA4A31 + .quad 0x03FE23274DE272A83 # 0.568659242528 784 + .quad 0x03FE23274DE272A83 + .quad 0x03FE235FD33D232FC # 0.569090462888 785 + .quad 0x03FE235FD33D232FC + .quad 0x03FE23B4A6F9D8688 # 0.569737642287 786 + .quad 0x03FE23B4A6F9D8688 + .quad 0x03FE23ED3BF21CA33 # 0.570169328026 787 + .quad 0x03FE23ED3BF21CA33 + .quad 0x03FE24422721A89D7 # 0.570817206248 788 + .quad 0x03FE24422721A89D7 + .quad 0x03FE247ACBC023D2B # 0.571249358372 789 + .quad 0x03FE247ACBC023D2B + .quad 0x03FE24CFCE6F80D9B # 0.571897936927 790 + .quad 0x03FE24CFCE6F80D9B + .quad 0x03FE250882BCDD7D8 # 0.572330556445 791 + .quad 0x03FE250882BCDD7D8 + .quad 0x03FE255D9CF910A56 # 0.572979836849 792 + .quad 0x03FE255D9CF910A56 + .quad 0x03FE25B2C55CD5762 # 0.573629539091 793 + .quad 0x03FE25B2C55CD5762 + .quad 0x03FE25EB92D41992D # 0.574062908546 794 + .quad 0x03FE25EB92D41992D + .quad 0x03FE2640D2D99FFEA # 0.574713315073 795 + .quad 0x03FE2640D2D99FFEA + .quad 0x03FE2679B0166F51C # 0.575147154559 796 + .quad 0x03FE2679B0166F51C + .quad 0x03FE26CF07CAD8B00 # 0.575798266899 797 + .quad 0x03FE26CF07CAD8B00 + .quad 0x03FE2707F4D5F7C40 # 0.576232577438 798 + .quad 0x03FE2707F4D5F7C40 + .quad 0x03FE275D644670606 # 0.576884397124 799 + .quad 0x03FE275D644670606 + .quad 0x03FE27966128AB11B # 0.577319179739 800 + .quad 0x03FE27966128AB11B + .quad 0x03FE27EBE8626A387 # 0.577971708311 801 + .quad 0x03FE27EBE8626A387 + .quad 0x03FE2824F52493BD2 # 0.578406964030 802 + .quad 0x03FE2824F52493BD2 + .quad 0x03FE287A9434DBC7B # 0.579060203030 803 + .quad 0x03FE287A9434DBC7B + .quad 0x03FE28B3B0DFCEB80 # 0.579495932884 804 + .quad 0x03FE28B3B0DFCEB80 + .quad 0x03FE290967D3ED18D # 0.580149883861 805 + .quad 0x03FE290967D3ED18D + .quad 0x03FE294294708B773 # 0.580586088885 806 + .quad 0x03FE294294708B773 + .quad 0x03FE29986355D8C69 # 0.581240753393 807 + .quad 0x03FE29986355D8C69 + .quad 0x03FE29D19FED0C082 # 0.581677434622 808 + .quad 0x03FE29D19FED0C082 + .quad 0x03FE2A2786D0EC107 # 0.582332814220 809 + .quad 0x03FE2A2786D0EC107 + .quad 0x03FE2A60D36BA5253 # 0.582769972697 810 + .quad 0x03FE2A60D36BA5253 + .quad 0x03FE2AB6D25B86EF7 # 0.583426068948 811 + .quad 0x03FE2AB6D25B86EF7 + .quad 0x03FE2AF02F02BE4AB # 0.583863705716 812 + .quad 0x03FE2AF02F02BE4AB + .quad 0x03FE2B46460C1C2B3 # 0.584520520190 813 + .quad 0x03FE2B46460C1C2B3 + .quad 0x03FE2B7FB2C8D1CC1 # 0.584958636297 814 + .quad 0x03FE2B7FB2C8D1CC1 + .quad 0x03FE2BD5E1F9316F2 # 0.585616170568 815 + .quad 0x03FE2BD5E1F9316F2 + .quad 0x03FE2C0F5ED46CE8D # 0.586054767066 816 + .quad 0x03FE2C0F5ED46CE8D + .quad 0x03FE2C65A6395F5F5 # 0.586713022712 817 + .quad 0x03FE2C65A6395F5F5 + .quad 0x03FE2C9F333C2FE1E # 0.587152100656 818 + .quad 0x03FE2C9F333C2FE1E + .quad 0x03FE2CF592E351AE5 # 0.587811079263 819 + .quad 0x03FE2CF592E351AE5 + .quad 0x03FE2D2F3016CE0EF # 0.588250639709 820 + .quad 0x03FE2D2F3016CE0EF + .quad 0x03FE2D85A80DC7324 # 0.588910342867 821 + .quad 0x03FE2D85A80DC7324 + .quad 0x03FE2DBF557B0DF43 # 0.589350386878 822 + .quad 0x03FE2DBF557B0DF43 + .quad 0x03FE2E15E5CF91FA7 # 0.590010816181 823 + .quad 0x03FE2E15E5CF91FA7 + .quad 0x03FE2E4FA37FC9577 # 0.590451344823 824 + .quad 0x03FE2E4FA37FC9577 + .quad 0x03FE2E8967B3BF4E1 # 0.590892067615 825 + .quad 0x03FE2E8967B3BF4E1 + .quad 0x03FE2EE01A3BED567 # 0.591553516212 826 + .quad 0x03FE2EE01A3BED567 + .quad 0x03FE2F19EEBFB00BA # 0.591994725131 827 + .quad 0x03FE2F19EEBFB00BA + .quad 0x03FE2F70B9C67A7C2 # 0.592656903723 828 + .quad 0x03FE2F70B9C67A7C2 + .quad 0x03FE2FAA9EA342D04 # 0.593098599843 829 + .quad 0x03FE2FAA9EA342D04 + .quad 0x03FE3001823684D73 # 0.593761510043 830 + .quad 0x03FE3001823684D73 + .quad 0x03FE303B7775937EF # 0.594203694441 831 + .quad 0x03FE303B7775937EF + .quad 0x03FE309273A3340FC # 0.594867337868 832 + .quad 0x03FE309273A3340FC + .quad 0x03FE30CC794DD19D0 # 0.595310011625 833 + .quad 0x03FE30CC794DD19D0 + .quad 0x03FE3106858C76BB7 # 0.595752881428 834 + .quad 0x03FE3106858C76BB7 + .quad 0x03FE315DA4434068B # 0.596417554101 835 + .quad 0x03FE315DA4434068B + .quad 0x03FE3197C0FA80E6A # 0.596860914783 836 + .quad 0x03FE3197C0FA80E6A + .quad 0x03FE31EEF86D36EF1 # 0.597526324589 837 + .quad 0x03FE31EEF86D36EF1 + .quad 0x03FE322925A66E62D # 0.597970177237 838 + .quad 0x03FE322925A66E62D + .quad 0x03FE328075E32022F # 0.598636325813 839 + .quad 0x03FE328075E32022F + .quad 0x03FE32BAB3A7B21E9 # 0.599080671521 840 + .quad 0x03FE32BAB3A7B21E9 + .quad 0x03FE32F4F80D0B1BD # 0.599525214760 841 + .quad 0x03FE32F4F80D0B1BD + .quad 0x03FE334C6B15D30DD # 0.600192400374 842 + .quad 0x03FE334C6B15D30DD + .quad 0x03FE3386C013B90D6 # 0.600637438209 843 + .quad 0x03FE3386C013B90D6 + .quad 0x03FE33DE4C086C40A # 0.601305366543 844 + .quad 0x03FE33DE4C086C40A + .quad 0x03FE3418B1A85622C # 0.601750900077 845 + .quad 0x03FE3418B1A85622C + .quad 0x03FE34531DF21CFE3 # 0.602196632199 846 + .quad 0x03FE34531DF21CFE3 + .quad 0x03FE34AACCE299BA5 # 0.602865603124 847 + .quad 0x03FE34AACCE299BA5 + .quad 0x03FE34E549DBB21EF # 0.603311832493 848 + .quad 0x03FE34E549DBB21EF + .quad 0x03FE353D11DA4F855 # 0.603981550121 849 + .quad 0x03FE353D11DA4F855 + .quad 0x03FE35779F8C43D6D # 0.604428277847 850 + .quad 0x03FE35779F8C43D6D + .quad 0x03FE35B233F13DD4A # 0.604875205229 851 + .quad 0x03FE35B233F13DD4A + .quad 0x03FE360A1F1BBA738 # 0.605545971045 852 + .quad 0x03FE360A1F1BBA738 + .quad 0x03FE3644C446F97BC # 0.605993398346 853 + .quad 0x03FE3644C446F97BC + .quad 0x03FE367F702A9EA94 # 0.606441025927 854 + .quad 0x03FE367F702A9EA94 + .quad 0x03FE36D77E9D34FD7 # 0.607112843218 855 + .quad 0x03FE36D77E9D34FD7 + .quad 0x03FE37123B54987B7 # 0.607560972287 856 + .quad 0x03FE37123B54987B7 + .quad 0x03FE376A630C0A1D6 # 0.608233542652 857 + .quad 0x03FE376A630C0A1D6 + .quad 0x03FE37A530A0D5A31 # 0.608682174333 858 + .quad 0x03FE37A530A0D5A31 + .quad 0x03FE37E004F74E13B # 0.609131007374 859 + .quad 0x03FE37E004F74E13B + .quad 0x03FE383850278CFD9 # 0.609804634884 860 + .quad 0x03FE383850278CFD9 + .quad 0x03FE3873356902AB7 # 0.610253972119 861 + .quad 0x03FE3873356902AB7 + .quad 0x03FE38AE2171976E8 # 0.610703511349 862 + .quad 0x03FE38AE2171976E8 + .quad 0x03FE390690373AFFF # 0.611378199331 863 + .quad 0x03FE390690373AFFF + .quad 0x03FE39418D3872A53 # 0.611828244343 864 + .quad 0x03FE39418D3872A53 + .quad 0x03FE397C91064221F # 0.612278491987 865 + .quad 0x03FE397C91064221F + .quad 0x03FE39D5237E045A5 # 0.612954243787 866 + .quad 0x03FE39D5237E045A5 + .quad 0x03FE3A1038522CE82 # 0.613404998809 867 + .quad 0x03FE3A1038522CE82 + .quad 0x03FE3A68E45AD354B # 0.614081512534 868 + .quad 0x03FE3A68E45AD354B + .quad 0x03FE3AA40A3F2A68B # 0.614532776080 869 + .quad 0x03FE3AA40A3F2A68B + .quad 0x03FE3ADF36F98A182 # 0.614984243356 870 + .quad 0x03FE3ADF36F98A182 + .quad 0x03FE3B3806E5DF340 # 0.615661826668 871 + .quad 0x03FE3B3806E5DF340 + .quad 0x03FE3B7344BE40311 # 0.616113804077 872 + .quad 0x03FE3B7344BE40311 + .quad 0x03FE3BAE897234A87 # 0.616565985862 873 + .quad 0x03FE3BAE897234A87 + .quad 0x03FE3C077D5F51881 # 0.617244642149 874 + .quad 0x03FE3C077D5F51881 + .quad 0x03FE3C42D33F2AE7B # 0.617697335683 875 + .quad 0x03FE3C42D33F2AE7B + .quad 0x03FE3C7E30002960C # 0.618150234241 876 + .quad 0x03FE3C7E30002960C + .quad 0x03FE3CD7480B4A8A3 # 0.618829966906 877 + .quad 0x03FE3CD7480B4A8A3 + .quad 0x03FE3D12B60622748 # 0.619283378838 878 + .quad 0x03FE3D12B60622748 + .quad 0x03FE3D4E2AE7B7E2B # 0.619736996447 879 + .quad 0x03FE3D4E2AE7B7E2B + .quad 0x03FE3D89A6B1A558D # 0.620190819917 880 + .quad 0x03FE3D89A6B1A558D + .quad 0x03FE3DE2ED57B1F9B # 0.620871941524 881 + .quad 0x03FE3DE2ED57B1F9B + .quad 0x03FE3E1E7A6D8330E # 0.621326280468 882 + .quad 0x03FE3E1E7A6D8330E + .quad 0x03FE3E5A0E714DA6E # 0.621780825931 883 + .quad 0x03FE3E5A0E714DA6E + .quad 0x03FE3EB37978B85B6 # 0.622463031756 884 + .quad 0x03FE3EB37978B85B6 + .quad 0x03FE3EEF1ED68236B # 0.622918094335 885 + .quad 0x03FE3EEF1ED68236B + .quad 0x03FE3F2ACB27ED6C7 # 0.623373364090 886 + .quad 0x03FE3F2ACB27ED6C7 + .quad 0x03FE3F845AAE68C81 # 0.624056657591 887 + .quad 0x03FE3F845AAE68C81 + .quad 0x03FE3FC0186800514 # 0.624512446113 888 + .quad 0x03FE3FC0186800514 + .quad 0x03FE3FFBDD1AE8406 # 0.624968442473 889 + .quad 0x03FE3FFBDD1AE8406 + .quad 0x03FE4037A8C8C197A # 0.625424646860 890 + .quad 0x03FE4037A8C8C197A + .quad 0x03FE409167679DD99 # 0.626109343909 891 + .quad 0x03FE409167679DD99 + .quad 0x03FE40CD448FF6DD6 # 0.626566069196 892 + .quad 0x03FE40CD448FF6DD6 + .quad 0x03FE410928B8F950F # 0.627023003177 893 + .quad 0x03FE410928B8F950F + .quad 0x03FE41630C1B50AFF # 0.627708795866 894 + .quad 0x03FE41630C1B50AFF + .quad 0x03FE419F01CD27AD0 # 0.628166252416 895 + .quad 0x03FE419F01CD27AD0 + .quad 0x03FE41DAFE85672B9 # 0.628623918328 896 + .quad 0x03FE41DAFE85672B9 + .quad 0x03FE42170245B4C6A # 0.629081793794 897 + .quad 0x03FE42170245B4C6A + .quad 0x03FE42711518DF546 # 0.629769000326 898 + .quad 0x03FE42711518DF546 + .quad 0x03FE42AD2A74888A0 # 0.630227400518 899 + .quad 0x03FE42AD2A74888A0 + .quad 0x03FE42E946DE080C0 # 0.630686010936 900 + .quad 0x03FE42E946DE080C0 + .quad 0x03FE43437EB9D9424 # 0.631374321162 901 + .quad 0x03FE43437EB9D9424 + .quad 0x03FE437FACCD31C10 # 0.631833457993 902 + .quad 0x03FE437FACCD31C10 + .quad 0x03FE43BBE1F42FE09 # 0.632292805727 903 + .quad 0x03FE43BBE1F42FE09 + .quad 0x03FE43F81E307DE5E # 0.632752364559 904 + .quad 0x03FE43F81E307DE5E + .quad 0x03FE445285D68EA69 # 0.633442099038 905 + .quad 0x03FE445285D68EA69 + .quad 0x03FE448ED3CF71355 # 0.633902186463 906 + .quad 0x03FE448ED3CF71355 + .quad 0x03FE44CB28E37C3EE # 0.634362485666 907 + .quad 0x03FE44CB28E37C3EE + .quad 0x03FE450785145CAFE # 0.634822996841 908 + .quad 0x03FE450785145CAFE + .quad 0x03FE45621CB769366 # 0.635514161481 909 + .quad 0x03FE45621CB769366 + .quad 0x03FE459E8AB7B799D # 0.635975203444 910 + .quad 0x03FE459E8AB7B799D + .quad 0x03FE45DAFFDABD4DB # 0.636436458065 911 + .quad 0x03FE45DAFFDABD4DB + .quad 0x03FE46177C2229EC0 # 0.636897925539 912 + .quad 0x03FE46177C2229EC0 + .quad 0x03FE467243F53F69E # 0.637590526283 913 + .quad 0x03FE467243F53F69E + .quad 0x03FE46AED21F117FC # 0.638052526753 914 + .quad 0x03FE46AED21F117FC + .quad 0x03FE46EB677335D13 # 0.638514740766 915 + .quad 0x03FE46EB677335D13 + .quad 0x03FE472803F35EAAE # 0.638977168520 916 + .quad 0x03FE472803F35EAAE + .quad 0x03FE4764A7A13EF3B # 0.639439810212 917 + .quad 0x03FE4764A7A13EF3B + .quad 0x03FE47BFAA9F80271 # 0.640134174319 918 + .quad 0x03FE47BFAA9F80271 + .quad 0x03FE47FC60471DAF8 # 0.640597351724 919 + .quad 0x03FE47FC60471DAF8 + .quad 0x03FE48391D226992D # 0.641060743762 920 + .quad 0x03FE48391D226992D + .quad 0x03FE4875E1331971E # 0.641524350631 921 + .quad 0x03FE4875E1331971E + .quad 0x03FE48D114D3FB884 # 0.642220164181 922 + .quad 0x03FE48D114D3FB884 + .quad 0x03FE490DEAF1A3FC8 # 0.642684309003 923 + .quad 0x03FE490DEAF1A3FC8 + .quad 0x03FE494AC84AB0ED3 # 0.643148669355 924 + .quad 0x03FE494AC84AB0ED3 + .quad 0x03FE4987ACE0DABB0 # 0.643613245438 925 + .quad 0x03FE4987ACE0DABB0 + .quad 0x03FE49C498B5DA63F # 0.644078037452 926 + .quad 0x03FE49C498B5DA63F + .quad 0x03FE4A20080EF10B2 # 0.644775630783 927 + .quad 0x03FE4A20080EF10B2 + .quad 0x03FE4A5D060894B8C # 0.645240963504 928 + .quad 0x03FE4A5D060894B8C + .quad 0x03FE4A9A0B471A943 # 0.645706512861 929 + .quad 0x03FE4A9A0B471A943 + .quad 0x03FE4AD717CC3E626 # 0.646172279055 930 + .quad 0x03FE4AD717CC3E626 + .quad 0x03FE4B142B99BC871 # 0.646638262288 931 + .quad 0x03FE4B142B99BC871 + .quad 0x03FE4B6FD6F970C1F # 0.647337644529 932 + .quad 0x03FE4B6FD6F970C1F + .quad 0x03FE4BACFD036D080 # 0.647804171246 933 + .quad 0x03FE4BACFD036D080 + .quad 0x03FE4BEA2A5BDBE87 # 0.648270915712 934 + .quad 0x03FE4BEA2A5BDBE87 + .quad 0x03FE4C275F047C956 # 0.648737878130 935 + .quad 0x03FE4C275F047C956 + .quad 0x03FE4C649AFF0EE16 # 0.649205058703 936 + .quad 0x03FE4C649AFF0EE16 + .quad 0x03FE4CC082B46485A # 0.649906239052 937 + .quad 0x03FE4CC082B46485A + .quad 0x03FE4CFDD1037E37C # 0.650373965908 938 + .quad 0x03FE4CFDD1037E37C + .quad 0x03FE4D3B26AAADDD9 # 0.650841911635 939 + .quad 0x03FE4D3B26AAADDD9 + .quad 0x03FE4D7883ABB61F6 # 0.651310076438 940 + .quad 0x03FE4D7883ABB61F6 + .quad 0x03FE4DB5E8085A477 # 0.651778460521 941 + .quad 0x03FE4DB5E8085A477 + .quad 0x03FE4DF353C25E42B # 0.652247064091 942 + .quad 0x03FE4DF353C25E42B + .quad 0x03FE4E4F832C560DD # 0.652950381434 943 + .quad 0x03FE4E4F832C560DD + .quad 0x03FE4E8D015786F16 # 0.653419534621 944 + .quad 0x03FE4E8D015786F16 + .quad 0x03FE4ECA86E64A683 # 0.653888908016 945 + .quad 0x03FE4ECA86E64A683 + .quad 0x03FE4F0813DA673DD # 0.654358501826 946 + .quad 0x03FE4F0813DA673DD + .quad 0x03FE4F45A835A4E19 # 0.654828316258 947 + .quad 0x03FE4F45A835A4E19 + .quad 0x03FE4F8343F9CB678 # 0.655298351519 948 + .quad 0x03FE4F8343F9CB678 + .quad 0x03FE4FDFBB88A119A # 0.656003818920 949 + .quad 0x03FE4FDFBB88A119A + .quad 0x03FE501D69DADD660 # 0.656474407164 950 + .quad 0x03FE501D69DADD660 + .quad 0x03FE505B1F9C43ED7 # 0.656945216966 951 + .quad 0x03FE505B1F9C43ED7 + .quad 0x03FE5098DCCE9FABA # 0.657416248534 952 + .quad 0x03FE5098DCCE9FABA + .quad 0x03FE50D6A173BC425 # 0.657887502077 953 + .quad 0x03FE50D6A173BC425 + .quad 0x03FE51146D8D65F98 # 0.658358977805 954 + .quad 0x03FE51146D8D65F98 + .quad 0x03FE5152411D69C03 # 0.658830675927 955 + .quad 0x03FE5152411D69C03 + .quad 0x03FE51AF0C774A2D0 # 0.659538640558 956 + .quad 0x03FE51AF0C774A2D0 + .quad 0x03FE51ECF2B713F8A # 0.660010895584 957 + .quad 0x03FE51ECF2B713F8A + .quad 0x03FE522AE0738A3D8 # 0.660483373741 958 + .quad 0x03FE522AE0738A3D8 + .quad 0x03FE5268D5AE7CDCB # 0.660956075239 959 + .quad 0x03FE5268D5AE7CDCB + .quad 0x03FE52A6D269BC600 # 0.661429000289 960 + .quad 0x03FE52A6D269BC600 + .quad 0x03FE52E4D6A719F9B # 0.661902149103 961 + .quad 0x03FE52E4D6A719F9B + .quad 0x03FE5322E26867857 # 0.662375521893 962 + .quad 0x03FE5322E26867857 + .quad 0x03FE53800225BA6E2 # 0.663086001497 963 + .quad 0x03FE53800225BA6E2 + .quad 0x03FE53BE20B8DA502 # 0.663559935155 964 + .quad 0x03FE53BE20B8DA502 + .quad 0x03FE53FC46D64DDD1 # 0.664034093533 965 + .quad 0x03FE53FC46D64DDD1 + .quad 0x03FE543A747FE9ED6 # 0.664508476843 966 + .quad 0x03FE543A747FE9ED6 + .quad 0x03FE5478A9B78404C # 0.664983085300 967 + .quad 0x03FE5478A9B78404C + .quad 0x03FE54B6E67EF251C # 0.665457919117 968 + .quad 0x03FE54B6E67EF251C + .quad 0x03FE54F52AD80BAE9 # 0.665932978509 969 + .quad 0x03FE54F52AD80BAE9 + .quad 0x03FE553376C4A7A16 # 0.666408263689 970 + .quad 0x03FE553376C4A7A16 + .quad 0x03FE5571CA469E5C9 # 0.666883774872 971 + .quad 0x03FE5571CA469E5C9 + .quad 0x03FE55CF55C5A5437 # 0.667597465874 972 + .quad 0x03FE55CF55C5A5437 + .quad 0x03FE560DBC45153C7 # 0.668073543008 973 + .quad 0x03FE560DBC45153C7 + .quad 0x03FE564C2A6059FE7 # 0.668549846899 974 + .quad 0x03FE564C2A6059FE7 + .quad 0x03FE568AA0194EC6E # 0.669026377763 975 + .quad 0x03FE568AA0194EC6E + .quad 0x03FE56C91D71CF810 # 0.669503135817 976 + .quad 0x03FE56C91D71CF810 + .quad 0x03FE5707A26BB8C66 # 0.669980121278 977 + .quad 0x03FE5707A26BB8C66 + .quad 0x03FE57462F08E7DF5 # 0.670457334363 978 + .quad 0x03FE57462F08E7DF5 + .quad 0x03FE5784C34B3AC30 # 0.670934775289 979 + .quad 0x03FE5784C34B3AC30 + .quad 0x03FE57C35F3490183 # 0.671412444273 980 + .quad 0x03FE57C35F3490183 + .quad 0x03FE580202C6C7353 # 0.671890341535 981 + .quad 0x03FE580202C6C7353 + .quad 0x03FE5840AE03C0204 # 0.672368467291 982 + .quad 0x03FE5840AE03C0204 + .quad 0x03FE589EBD437CA31 # 0.673086084831 983 + .quad 0x03FE589EBD437CA31 + .quad 0x03FE58DD7BB392B30 # 0.673564782782 984 + .quad 0x03FE58DD7BB392B30 + .quad 0x03FE591C41D500163 # 0.674043709994 985 + .quad 0x03FE591C41D500163 + .quad 0x03FE595B0FA9A7EF1 # 0.674522866688 986 + .quad 0x03FE595B0FA9A7EF1 + .quad 0x03FE5999E5336E121 # 0.675002253082 987 + .quad 0x03FE5999E5336E121 + .quad 0x03FE59D8C2743705E # 0.675481869398 988 + .quad 0x03FE59D8C2743705E + .quad 0x03FE5A17A76DE803B # 0.675961715857 989 + .quad 0x03FE5A17A76DE803B + .quad 0x03FE5A56942266F7B # 0.676441792678 990 + .quad 0x03FE5A56942266F7B + .quad 0x03FE5A9588939A810 # 0.676922100084 991 + .quad 0x03FE5A9588939A810 + .quad 0x03FE5AD484C369F2D # 0.677402638296 992 + .quad 0x03FE5AD484C369F2D + .quad 0x03FE5B1388B3BD53E # 0.677883407536 993 + .quad 0x03FE5B1388B3BD53E + .quad 0x03FE5B5294667D5F7 # 0.678364408027 994 + .quad 0x03FE5B5294667D5F7 + .quad 0x03FE5B91A7DD93852 # 0.678845639990 995 + .quad 0x03FE5B91A7DD93852 + .quad 0x03FE5BD0C31AE9E9D # 0.679327103649 996 + .quad 0x03FE5BD0C31AE9E9D + .quad 0x03FE5C2F7A8ED5E5B # 0.680049734055 997 + .quad 0x03FE5C2F7A8ED5E5B + .quad 0x03FE5C6EA94431EF9 # 0.680531777930 998 + .quad 0x03FE5C6EA94431EF9 + .quad 0x03FE5CADDFC6874F5 # 0.681014054284 999 + .quad 0x03FE5CADDFC6874F5 + .quad 0x03FE5CED1E17C35C6 # 0.681496563340 1000 + .quad 0x03FE5CED1E17C35C6 + .quad 0x03FE5D2C6439D4252 # 0.681979305324 1001 + .quad 0x03FE5D2C6439D4252 + .quad 0x03FE5D6BB22EA86F6 # 0.682462280460 1002 + .quad 0x03FE5D6BB22EA86F6 + .quad 0x03FE5DAB07F82FB84 # 0.682945488974 1003 + .quad 0x03FE5DAB07F82FB84 + .quad 0x03FE5DEA65985A350 # 0.683428931091 1004 + .quad 0x03FE5DEA65985A350 + .quad 0x03FE5E29CB1118D32 # 0.683912607038 1005 + .quad 0x03FE5E29CB1118D32 + .quad 0x03FE5E6938645D390 # 0.684396517040 1006 + .quad 0x03FE5E6938645D390 + .quad 0x03FE5EA8AD9419C5B # 0.684880661324 1007 + .quad 0x03FE5EA8AD9419C5B + .quad 0x03FE5EE82AA241920 # 0.685365040118 1008 + .quad 0x03FE5EE82AA241920 + .quad 0x03FE5F27AF90C8705 # 0.685849653648 1009 + .quad 0x03FE5F27AF90C8705 + .quad 0x03FE5F673C61A2ED2 # 0.686334502142 1010 + .quad 0x03FE5F673C61A2ED2 + .quad 0x03FE5FA6D116C64F7 # 0.686819585829 1011 + .quad 0x03FE5FA6D116C64F7 + .quad 0x03FE5FE66DB228992 # 0.687304904936 1012 + .quad 0x03FE5FE66DB228992 + .quad 0x03FE60261235C0874 # 0.687790459692 1013 + .quad 0x03FE60261235C0874 + .quad 0x03FE6065BEA385926 # 0.688276250325 1014 + .quad 0x03FE6065BEA385926 + .quad 0x03FE60A572FD6FEF1 # 0.688762277066 1015 + .quad 0x03FE60A572FD6FEF1 + .quad 0x03FE60E52F45788E4 # 0.689248540144 1016 + .quad 0x03FE60E52F45788E4 + .quad 0x03FE6124F37D991D4 # 0.689735039789 1017 + .quad 0x03FE6124F37D991D4 + .quad 0x03FE6164BFA7CC06C # 0.690221776231 1018 + .quad 0x03FE6164BFA7CC06C + .quad 0x03FE61A493C60C729 # 0.690708749700 1019 + .quad 0x03FE61A493C60C729 + .quad 0x03FE61E46FDA56466 # 0.691195960429 1020 + .quad 0x03FE61E46FDA56466 + .quad 0x03FE622453E6A6263 # 0.691683408647 1021 + .quad 0x03FE622453E6A6263 + .quad 0x03FE62643FECF9743 # 0.692171094587 1022 + .quad 0x03FE62643FECF9743 + .quad 0x03FE62A433EF4E51A # 0.692659018480 1023 + .quad 0x03FE62A433EF4E51A + + +
diff --git a/src/gas/vrdasin.S b/src/gas/vrdasin.S new file mode 100644 index 0000000..a5fb8d4 --- /dev/null +++ b/src/gas/vrdasin.S
@@ -0,0 +1,3073 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrdasin.s +# +# An array implementation of the sin libm function. +# +# Prototype: +# +# void vrda_sin(int n, double *x, double *y); +# +#Computes Sine of x for an array of input values. +#Places the results into the supplied y array. +#Does not perform error checking. +#Denormal inputs may produce unexpected results +#Author: Harsha Jagasia +#Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 16 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.Lcosarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03fa5555555555555 + .quad 0x0bf56c16c16c16967 # -0.00138889 c2 + .quad 0x0bf56c16c16c16967 + .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3 + .quad 0x03efa01a019f4ec90 + .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4 + .quad 0x0be927e4fa17f65f6 + .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5 + .quad 0x03e21eeb69037ab78 + .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6 + .quad 0x0bda907db46cc5e42 +.Lsinarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bfc5555555555555 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03f81111111110bb3 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0bf2a01a019e83e5c + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03ec71de3796cde01 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0be5ae600b42fdfa7 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x03de5e0b2f9a43bb8 +.Lsincosarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x0bf56c16c16c16967 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x03efa01a019f4ec90 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x0be927e4fa17f65f6 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x03e21eeb69037ab78 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x0bda907db46cc5e42 +.Lcossinarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bf56c16c16c16967 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03efa01a019f4ec90 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0be927e4fa17f65f6 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03e21eeb69037ab78 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0bda907db46cc5e42 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + +.Levensin_oddcos_tbl: + .quad .Lsinsin_sinsin_piby4 # 0 + .quad .Lsinsin_sincos_piby4 # 1 + .quad .Lsinsin_cossin_piby4 # 2 + .quad .Lsinsin_coscos_piby4 # 3 + + .quad .Lsincos_sinsin_piby4 # 4 + .quad .Lsincos_sincos_piby4 # 5 + .quad .Lsincos_cossin_piby4 # 6 + .quad .Lsincos_coscos_piby4 # 7 + + .quad .Lcossin_sinsin_piby4 # 8 + .quad .Lcossin_sincos_piby4 # 9 + .quad .Lcossin_cossin_piby4 # 10 + .quad .Lcossin_coscos_piby4 # 11 + + .quad .Lcoscos_sinsin_piby4 # 12 + .quad .Lcoscos_sincos_piby4 # 13 + .quad .Lcoscos_cossin_piby4 # 14 + .quad .Lcoscos_coscos_piby4 # 15 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + .weak vrda_sin_ + .set vrda_sin_,__vrda_sin__ + .weak vrda_sin__ + .set vrda_sin__,__vrda_sin__ + + .text + .align 16 + .p2align 4,,15 + +#x/* a FORTRAN subroutine implementation of array sin +#** VRDA_SIN(N,X,Y) +# C equivalent*/ +#void vrda_sin__(int * n, double *x, double *y) +#{ +# vrda_sin(*n,x,y); +#} +.globl __vrda_sin__ + .type __vrda_sin__,@function +__vrda_sin__: + mov (%rdi),%edi + + .align 16 + .p2align 4,,15 + +# define local variable storage offsets +.equ p_temp, 0x00 # temporary for get/put bits operation +.equ p_temp1, 0x10 # temporary for get/put bits operation + +.equ save_xmm6, 0x20 # temporary for get/put bits operation +.equ save_xmm7, 0x30 # temporary for get/put bits operation +.equ save_xmm8, 0x40 # temporary for get/put bits operation +.equ save_xmm9, 0x50 # temporary for get/put bits operation +.equ save_xmm10, 0x60 # temporary for get/put bits operation +.equ save_xmm11, 0x70 # temporary for get/put bits operation +.equ save_xmm12, 0x80 # temporary for get/put bits operation +.equ save_xmm13, 0x90 # temporary for get/put bits operation +.equ save_xmm14, 0x0A0 # temporary for get/put bits operation +.equ save_xmm15, 0x0B0 # temporary for get/put bits operation + +.equ r, 0x0C0 # pointer to r for remainder_piby2 +.equ rr, 0x0D0 # pointer to r for remainder_piby2 +.equ region, 0x0E0 # pointer to r for remainder_piby2 + +.equ r1, 0x0F0 # pointer to r for remainder_piby2 +.equ rr1, 0x0100 # pointer to r for remainder_piby2 +.equ region1, 0x0110 # pointer to r for remainder_piby2 + +.equ p_temp2, 0x0120 # temporary for get/put bits operation +.equ p_temp3, 0x0130 # temporary for get/put bits operation + +.equ p_temp4, 0x0140 # temporary for get/put bits operation +.equ p_temp5, 0x0150 # temporary for get/put bits operation + +.equ p_original, 0x0160 # original x +.equ p_mask, 0x0170 # original x +.equ p_sign, 0x0180 # original x + +.equ p_original1, 0x0190 # original x +.equ p_mask1, 0x01A0 # original x +.equ p_sign1, 0x01B0 # original x + +.equ save_r12, 0x01C0 # temporary for get/put bits operation +.equ save_r13, 0x01D0 # temporary for get/put bits operation + +.equ save_xa, 0x01E0 #qword +.equ save_ya, 0x01F0 #qword + +.equ save_nv, 0x0200 #qword +.equ p_iter, 0x0210 # qword storage for number of loop iterations + + +.globl vrda_sin + .type vrda_sin,@function +vrda_sin: + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# parameters are passed in by Linux as: +# rcx - int n +# rdx - double *x +# r8 - double *y + + sub $0x228,%rsp + mov %r12,save_r12(%rsp) # save r12 + mov %r13,save_r13(%rsp) # save r13 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#START PROCESS INPUT + +# save the arguments + mov %rsi, save_xa(%rsp) # save x_array pointer + mov %rdx, save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + mov %rdi,save_nv(%rsp) # save number of values + +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vrda_cleanup # jump if only single calls + +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#START LOOP +.align 16 +.L__vrda_top: + +# build the input _m128d + movapd .L__real_7fffffffffffffff(%rip),%xmm2 + + mov save_xa(%rsp),%rsi # get x_array pointer + movlpd (%rsi),%xmm0 + movhpd 8(%rsi),%xmm0 + mov (%rsi),%rax + mov 8(%rsi),%rcx + movdqa %xmm0,%xmm6 + + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + movlpd -16(%rsi), %xmm1 + movhpd -8(%rsi), %xmm1 + mov -16(%rsi), %r8 + mov -8(%rsi), %r9 + movdqa %xmm1,%xmm7 + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#STARTMAIN + +andpd %xmm2,%xmm0 #Unsign +andpd %xmm2,%xmm1 #Unsign + +and .L__real_7fffffffffffffff(%rip), %rax +and .L__real_7fffffffffffffff(%rip), %rcx +and .L__real_7fffffffffffffff(%rip), %r8 +and .L__real_7fffffffffffffff(%rip), %r9 + +movdqa %xmm0,%xmm12 +movdqa %xmm1,%xmm13 + +pcmpgtd %xmm6,%xmm12 +pcmpgtd %xmm7,%xmm13 +movdqa %xmm12,%xmm6 +movdqa %xmm13,%xmm7 +psrldq $4,%xmm12 +psrldq $4,%xmm13 +psrldq $8,%xmm6 +psrldq $8,%xmm7 + +mov $0x3FE921FB54442D18,%rdx #piby4 + +mov $0x411E848000000000,%r10 #5e5 + +movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + + +por %xmm6,%xmm12 +por %xmm7,%xmm13 +movd %xmm12,%r12 #Move Sign to gpr ** +movd %xmm13,%r13 #Move Sign to gpr ** + +movapd %xmm0,%xmm2 #x0 +movapd %xmm1,%xmm3 #x1 +movapd %xmm0,%xmm6 #x0 +movapd %xmm1,%xmm7 #x1 + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm2 = x, xmm4 =0.5/t, xmm6 =x +# xmm3 = x, xmm5 =0.5/t, xmm7 =x +.align 16 +.Leither_or_both_arg_gt_than_piby4: + cmp %r10,%rax + jae .Lfirst_or_next3_arg_gt_5e5 + + cmp %r10,%rcx + jae .Lsecond_or_next2_arg_gt_5e5 + + cmp %r10,%r8 + jae .Lthird_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfourth_arg_gt_5e5 + + +# /* Find out what multiple of piby2 */ +# npi2 = (int)(x * twobypi + 0.5); + movapd .L__real_3fe45f306dc9c883(%rip),%xmm0 + mulpd %xmm0,%xmm2 # * twobypi + mulpd %xmm0,%xmm3 # * twobypi + + addpd %xmm4,%xmm2 # +0.5, npi2 + addpd %xmm4,%xmm3 # +0.5, npi2 + + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + + cvtdq2pd %xmm4,%xmm2 # and back to double. + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + + movd %xmm4,%r8 # Region + movd %xmm5,%r9 # Region + + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + mov %r8,%r10 + mov %r9,%r11 + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm1,%xmm7 # t-rhead + + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = npi2 (int), xmm0 =rhead, xmm8 =rtail +# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + mov %r10,%rax + mov %r11,%rcx + + not %r12 #ADDED TO CHANGE THE LOGIC + not %r13 #ADDED TO CHANGE THE LOGIC + and %r12,%r10 + and %r13,%r11 + + not %rax + not %rcx + not %r12 + not %r13 + and %r12,%rax + and %r13,%rcx + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r12,p_sign+8(%rsp) #write out upper sign bit + mov %r11,p_sign1(%rsp) #write out lower sign bit + mov %r13,p_sign1+8(%rsp) #write out upper sign bit + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead +# xmm4 = Sign, xmm0 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail + movapd %xmm0,%xmm6 # rhead + movapd %xmm1,%xmm7 # rhead + + subpd %xmm8,%xmm0 # r = rhead - rtail + subpd %xmm9,%xmm1 # r = rhead - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail + + subpd %xmm0,%xmm6 #rr=rhead-r + subpd %xmm1,%xmm7 #rr=rhead-r + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm0,%xmm2 # move r for r2 + movapd %xmm1,%xmm3 # move r for r2 + + mulpd %xmm0,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + subpd %xmm8,%xmm6 #rr=(rhead-r) -rtail + subpd %xmm9,%xmm7 #rr=(rhead-r) -rtail + + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + +#DEBUG +# jmp .Lfinal_check +#DEBUG + + leaq .Levensin_oddcos_tbl(%rip),%rsi + jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfirst_or_next3_arg_gt_5e5: +# %rcx,,%rax r8, r9 + + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Be sure not to use %xmm3,%xmm1 and xmm7 +# Use %xmm8,,%xmm5 xmm10, xmm12 +# %xmm11,,%xmm9 xmm13 + + + movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm6,%xmm6 + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1 + cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm6,%xmm0 + subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm6 # rr=rhead-r + subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm0,r+8(%rsp) # store upper r + movlpd %xmm6,rr+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_lower_naninf + + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrd4_sin_lower_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) # rr = 0 + mov %r10d,region(%rsp) # region =0 + +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + + movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_lower_naninf_of_both_gt_5e5 + + mov %rcx,p_temp(%rsp) #Save upper arg + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + call __amd_remainder_piby2@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + mov p_temp(%rsp),%rcx #Restore upper arg + jmp 0f + +.L__vrd4_sin_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf +# mov p_original(r%sp),%rax + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) #rr = 0 + mov %r10d,region(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_upper_naninf_of_both_gt_5e5 + + + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrd4_sin_upper_naninf_of_both_gt_5e5: +# mov p_original+8(%rsp),%rcx ;upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) #rr = 0 + mov %r10d,region+4(%rsp) #region = 0 + +.align 16 +0: + jmp .Lcheck_next2_args + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsecond_or_next2_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Restore xmm4 and %xmm3,,%xmm1 xmm7 +# Can use %xmm10,,%xmm8 xmm12 +# %xmm9,,%xmm5 xmm11, xmm13 + + movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call +# movlhps %xmm0,%xmm0 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case +# movlhps %xmm2,%xmm2 +# movlhps %xmm6,%xmm6 + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1 + cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm1 = piby2_2 + cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store upper region + movsd %xmm6,%xmm0 + subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm6 # rr=rhead-r + subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm0,r(%rsp) # store upper r + movlpd %xmm6,rr(%rsp) # store upper rr + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_upper_naninf + + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrd4_sin_upper_naninf: +# mov p_original+8(%rsp),%rcx ; upper arg is nan/inf +# mov r+8(%rsp),%rcx ; upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) # rr = 0 + mov %r10d,region+4(%rsp) # region =0 + +.align 16 +0: + + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcheck_next2_args: + + mov $0x411E848000000000,%r10 #5e5 + + + cmp %r10,%r8 + jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfirst_second_done_fourth_arg_gt_5e5 + + + +# Work on next two args, both < 5e5 +# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5 + + movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi + addpd %xmm4,%xmm3 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm5,region1(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm1,%xmm7 # t-rhead + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm1,%xmm7 # rhead + subpd %xmm9,%xmm1 # r = rhead - rtail + movapd %xmm1,r1(%rsp) + + subpd %xmm1,%xmm7 # rr=rhead-r + subpd %xmm9,%xmm7 # rr=(rhead-r) -rtail + movapd %xmm7,rr1(%rsp) + + jmp .L__vrd4_sin_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lthird_or_fourth_arg_gt_5e5: +#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Can use %xmm11,,%xmm9 xmm13 +# %xmm8,,%xmm5 xmm10, xmm12 +# Restore xmm4 + +# Work on first two args, both < 5e5 + + + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm4,region(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm0,%xmm6 # rhead + subpd %xmm8,%xmm0 # r = rhead - rtail + movapd %xmm0,r(%rsp) + + subpd %xmm0,%xmm6 # rr=rhead-r + subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail + movapd %xmm6,rr(%rsp) + + +# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_third_or_fourth_arg_gt_5e5: +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + + mov $0x411E848000000000,%r10 #5e5 + + cmp %r10,%r9 + jae .Lboth_arg_gt_5e5_higher + + +# Upper Arg is <5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg + movhlps %xmm3,%xmm3 + movhlps %xmm7,%xmm7 + + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r9d,region1+4(%rsp) # store upper region + movsd %xmm7,%xmm1 + subsd %xmm0,%xmm1 # xmm1 = r=(rhead-rtail) + subsd %xmm1,%xmm7 # rr=rhead-r + subsd %xmm0,%xmm7 # xmm7 = rr=((rhead-r) -rtail) + movlpd %xmm1,r1+8(%rsp) # store upper r + movlpd %xmm7,rr1+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_lower_naninf_higher + + lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf + lea rr1(%rsp),%rsi + lea r1(%rsp),%rdi + movlpd r1(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_sin_lower_naninf_higher: +# mov p_original1(%rsp),%r8 ; upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1(%rsp) # rr = 0 + mov %r10d,region1(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrd4_sin_reconstruct + +.align 16 +.Lboth_arg_gt_5e5_higher: +# Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher + + mov %r9,p_temp1(%rsp) #Save upper arg + lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf + lea rr1(%rsp),%rsi + lea r1(%rsp),%rdi + movsd %xmm1,%xmm0 + call __amd_remainder_piby2@PLT + mov p_temp1(%rsp),%r9 #Restore upper arg + jmp 0f + +.L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf +# mov p_original1(%rsp),%r8 + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1(%rsp) #rr = 0 + mov %r10d,region1(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher + + lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea rr1+8(%rsp),%rsi + lea r1+8(%rsp),%rdi + movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher: +# mov p_original1+8(%rsp),%r9 ;upper arg is nan/inf +# movd %xmm6,%r9 ;upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1+8(%rsp) #rr = 0 + mov %r10d,region1+4(%rsp) #region = 0 + +.align 16 +0: + jmp .L__vrd4_sin_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfourth_arg_gt_5e5: +#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5 +#%rcx,,%rax r8, r9 +#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + +# Work on first two args, both < 5e5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm4,region(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm0,%xmm6 # rhead + subpd %xmm8,%xmm0 # r = rhead - rtail + movapd %xmm0,r(%rsp) + + subpd %xmm0,%xmm6 # rr=rhead-r + subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail + movapd %xmm6,rr(%rsp) + + +# Work on next two args, third arg < 5e5, fourth arg >= 5e5 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_fourth_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call +# movlhps %xmm1,%xmm1 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case +# movlhps %xmm3,%xmm3 +# movlhps %xmm7,%xmm7 + + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r8d,region1(%rsp) # store lower region + movsd %xmm7,%xmm1 + subsd %xmm0,%xmm1 # xmm0 = r=(rhead-rtail) + subsd %xmm1,%xmm7 # rr=rhead-r + subsd %xmm0,%xmm7 # xmm6 = rr=((rhead-r) -rtail) + + movlpd %xmm1,r1(%rsp) # store upper r + movlpd %xmm7,rr1(%rsp) # store upper rr + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_upper_naninf_higher + + lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea rr1+8(%rsp),%rsi + lea r1+8(%rsp),%rdi + movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_sin_upper_naninf_higher: +# mov p_original1+8(%rsp),%r9 ; upper arg is nan/inf +# mov r1+8(%rsp),%r9 ; upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1+8(%rsp) # rr = 0 + mov %r10d,region1+4(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrd4_sin_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrd4_sin_reconstruct: +#Results +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd r(%rsp),%xmm0 + movapd r1(%rsp),%xmm1 + + movapd rr(%rsp),%xmm6 + movapd rr1(%rsp),%xmm7 + + mov region(%rsp),%r8 + mov region1(%rsp),%r9 + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + + mov %r8,%r10 + mov %r9,%r11 + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + mov %r10,%rax + mov %r11,%rcx + + not %r12 #ADDED TO CHANGE THE LOGIC + not %r13 #ADDED TO CHANGE THE LOGIC + and %r12,%r10 + and %r13,%r11 + + not %rax + not %rcx + not %r12 + not %r13 + and %r12,%rax + and %r13,%rcx + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r12,p_sign+8(%rsp) #write out upper sign bit + mov %r11,p_sign1(%rsp) #write out lower sign bit + mov %r13,p_sign1+8(%rsp) #write out upper sign bit + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm0,%xmm2 + movapd %xmm1,%xmm3 + + mulpd %xmm0,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + leaq .Levensin_oddcos_tbl(%rip),%rsi + jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + + + + + + + + + + + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrd4_sin_cleanup: + + movapd p_sign(%rsp), %xmm0 + movapd p_sign1(%rsp), %xmm1 + xorpd %xmm4, %xmm0 # (+) Sign + xorpd %xmm5, %xmm1 # (+) Sign + +.L__vrda_bottom1: +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + movlpd %xmm0,(%rdi) + movhpd %xmm0,8(%rdi) + +.L__vrda_bottom2: + prefetch 64(%rdi) + add $32,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + +# store the result _m128d + movlpd %xmm1, -16(%rdi) + movhpd %xmm1, -8(%rdi) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vrda_top + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vrda_cleanup + +.L__final_check: + + mov save_r12(%rsp),%r12 # restore r12 + mov save_r13(%rsp),%r13 # restore r13 + + add $0x228,%rsp + ret + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# we jump here when we have an odd number of cos calls to make at the end +# The number of values left is in save_nv + +.align 16 +.L__vrda_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__final_check # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + +# fill in a m128d with zeroes and the extra values and then make a recursive call. + xorpd %xmm0,%xmm0 + movlpd %xmm0,p_temp+8(%rsp) + movapd %xmm0,p_temp+16(%rsp) + + mov (%rsi),%rcx # we know there's at least one + mov %rcx,p_temp(%rsp) + cmp $2,%rax + jl .L__vrdacg + + mov 8(%rsi),%rcx # do the second value + mov %rcx,p_temp+8(%rsp) + cmp $3,%rax + jl .L__vrdacg + + mov 16(%rsi),%rcx # do the third value + mov %rcx,p_temp+16(%rsp) + +.L__vrdacg: + mov $4,%rdi # parameter for N + lea p_temp(%rsp),%rsi # &x parameter + lea p_temp2(%rsp),%rdx # &y parameter + call vrda_sin@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + mov p_temp2(%rsp),%rcx + mov %rcx, (%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vrdacgf + + mov p_temp2+8(%rsp),%rcx + mov %rcx, 8(%rdi) # do the second value + cmp $3,%rax + jl .L__vrdacgf + + mov p_temp2+16(%rsp),%rcx + mov %rcx, 16(%rdi) # do the third value + +.L__vrdacgf: + jmp .L__final_check + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_coscos_piby4: + + + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 ;trash r + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm2 # x4 recalculate + mulpd %xmm3,%xmm3 # x4 recalculate + + movapd p_temp2(%rsp),%xmm12 # r + movapd p_temp3(%rsp),%xmm13 # r + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm1,%xmm7 # x * xx + + subpd %xmm12,%xmm10 # (1 + (-t)) - r + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + + subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1 + subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1 + + subpd %xmm12,%xmm4 # + t + subpd %xmm13,%xmm5 # + t + + jmp .L__vrd4_sin_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcossin_cossin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + + movdqa .Lsincosarray+0x50(%rip),%xmm4 # s6 + movdqa .Lsincosarray+0x50(%rip),%xmm5 # s6 + movapd .Lsincosarray+0x20(%rip),%xmm8 # s3 + movapd .Lsincosarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + movhlps %xmm10,%xmm10 # move high x4 for cos term + movhlps %xmm11,%xmm11 # move high x4 for cos term + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term + movsd %xmm3,%xmm7 # move low x2 for x3 for sin term + mulsd %xmm0,%xmm6 # get low x3 for sin term + mulsd %xmm1,%xmm7 # get low x3 for sin term + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high r for cos + movhlps %xmm3,%xmm13 # move high r for cos + movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos + movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm7,%xmm5 # sin *x3 + mulsd %xmm10,%xmm8 # cos *x4 + mulsd %xmm11,%xmm9 # cos *x4 + + mulsd p_temp(%rsp),%xmm2 # 0.5 * x2 * xx for sin term + mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term + movsd %xmm12,%xmm6 # Keep high r for cos term + movsd %xmm13,%xmm7 # Keep high r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0 + + subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx + subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx + + movhlps %xmm0,%xmm10 # move high x for x*xx for cos term + movhlps %xmm1,%xmm11 # move high x for x*xx for cos term + + mulsd p_temp+8(%rsp),%xmm10 # x * xx + mulsd p_temp1+8(%rsp),%xmm11 # x * xx + + movsd %xmm12,%xmm2 # move -t for cos term + movsd %xmm13,%xmm3 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm12 #1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm13 #1+(-t) + addsd p_temp(%rsp),%xmm4 # sin+xx + addsd p_temp1(%rsp),%xmm5 # sin+xx + subsd %xmm6,%xmm12 # (1-t) - r + subsd %xmm7,%xmm13 # (1-t) - r + subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx + subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx + addsd %xmm0,%xmm4 # sin + x + addsd %xmm1,%xmm5 # sin + x + addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx) + addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx) + subsd %xmm2,%xmm8 # cos+t + subsd %xmm3,%xmm9 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + jmp .L__vrd4_sin_cleanup + +.align 16 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lsincos_cossin_piby4: # changed from sincos_sincos + # xmm1 is cossin and xmm0 is sincos + + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + movapd %xmm1,p_temp3(%rsp) # Store r for the sincos term + + movapd .Lsincosarray+0x50(%rip),%xmm4 # s6 + movapd .Lcossinarray+0x50(%rip),%xmm5 # s6 + movdqa .Lsincosarray+0x20(%rip),%xmm8 # s3 + movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + movhlps %xmm10,%xmm10 # move high x4 for cos term + + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin) + movhlps %xmm3,%xmm7 # move high x2 for x3 for sin term (sincos) + + mulsd %xmm0,%xmm6 # get low x3 for sin term + mulsd p_temp3+8(%rsp),%xmm7 # get high x3 for sin term + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high r for cos (cossin) + movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term (sincos) + + movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin) + movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos) + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm11,%xmm5 # cos *x4 + mulsd %xmm10,%xmm8 # cos *x4 + mulsd %xmm7,%xmm9 # sin *x3 + + mulsd p_temp(%rsp),%xmm2 # low 0.5 * x2 * xx for sin term (cossin) + mulsd p_temp1+8(%rsp),%xmm13 # high 0.5 * x2 * xx for sin term (sincos) + + movsd %xmm12,%xmm6 # Keep high r for cos term + movsd %xmm3,%xmm7 # Keep low r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0 + + subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx (cossin) + subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx (sincos) + + movhlps %xmm0,%xmm10 # move high x for x*xx for cos term (cossin) + movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos) + + mulsd p_temp+8(%rsp),%xmm10 # x * xx + mulsd p_temp1(%rsp),%xmm1 # x * xx + + movsd %xmm12,%xmm2 # move -t for cos term + movsd %xmm3,%xmm13 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm12 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t) + + addsd p_temp(%rsp),%xmm4 # sin+xx + + addsd p_temp1+8(%rsp),%xmm9 # sin+xx + + + subsd %xmm6,%xmm12 # (1-t) - r + subsd %xmm7,%xmm3 # (1-t) - r + + subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx + subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx + + addsd %xmm0,%xmm4 # sin + x + + addsd %xmm11,%xmm9 # sin + x + + + addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx) + addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx) + + subsd %xmm2,%xmm8 # cos+t + subsd %xmm13,%xmm5 # cos+t + + movlhps %xmm8,%xmm4 # cossin + movlhps %xmm9,%xmm5 # sincos + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lsincos_sincos_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + movapd %xmm0,p_temp2(%rsp) # Store r + movapd %xmm1,p_temp3(%rsp) # Store r + + + movapd .Lcossinarray+0x50(%rip),%xmm4 # s6 + movapd .Lcossinarray+0x50(%rip),%xmm5 # s6 + movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3 + movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term + movhlps %xmm3,%xmm7 # move low x2 for x3 for sin term + mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term + mulsd p_temp3+8(%rsp),%xmm7 # get low x3 for sin term + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term + movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term + # Reverse 12 and 2 + + movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos + movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos + + mulsd %xmm6,%xmm8 # sin *x3 + mulsd %xmm7,%xmm9 # sin *x3 + mulsd %xmm10,%xmm4 # cos *x4 + mulsd %xmm11,%xmm5 # cos *x4 + + mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term + mulsd p_temp1+8(%rsp),%xmm13 # 0.5 * x2 * xx for sin term + movsd %xmm2,%xmm6 # Keep high r for cos term + movsd %xmm3,%xmm7 # Keep high r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 #-t=r-1.0 + + subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx + subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx + + movhlps %xmm0,%xmm10 # move high x for x for sin term + movhlps %xmm1,%xmm11 # move high x for x for sin term + # Reverse 10 and 0 + + mulsd p_temp(%rsp),%xmm0 # x * xx + mulsd p_temp1(%rsp),%xmm1 # x * xx + + movsd %xmm2,%xmm12 # move -t for cos term + movsd %xmm3,%xmm13 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t) + addsd p_temp+8(%rsp),%xmm8 # sin+xx + addsd p_temp1+8(%rsp),%xmm9 # sin+xx + + subsd %xmm6,%xmm2 # (1-t) - r + subsd %xmm7,%xmm3 # (1-t) - r + + subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx + subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx + + addsd %xmm10,%xmm8 # sin + x + addsd %xmm11,%xmm9 # sin + x + + addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx) + addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx) + + subsd %xmm12,%xmm4 # cos+t + subsd %xmm13,%xmm5 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lcossin_sincos_piby4: # changed from sincos_sincos + # xmm1 is cossin and xmm0 is sincos +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + movapd %xmm6,p_temp(%rsp) # Store rr + movapd %xmm7,p_temp1(%rsp) # Store rr + movapd %xmm0,p_temp2(%rsp) # Store r + + + movapd .Lcossinarray+0x50(%rip),%xmm4 # s6 + movapd .Lsincosarray+0x50(%rip),%xmm5 # s6 + movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3 + movdqa .Lsincosarray+0x20(%rip),%xmm9 # s3 + + movapd %xmm2,%xmm10 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3 + + movapd %xmm2,%xmm12 # move x2 for x6 + movapd %xmm3,%xmm13 # move x2 for x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2s3) + mulpd %xmm3,%xmm9 # x2(s2+x2s3) + + mulpd %xmm10,%xmm12 # x6 + mulpd %xmm11,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6) + addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3) + addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3) + + movhlps %xmm11,%xmm11 # move high x4 for cos term + + + mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6)) + mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6)) + + movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term + movsd %xmm3,%xmm7 # move low x2 for x3 for sin term + + mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term + mulsd %xmm1,%xmm7 # get low x3 for sin term + + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term + movhlps %xmm3,%xmm13 # move high r for cos + + movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos + movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin + + mulsd %xmm6,%xmm8 # sin *x3 + mulsd %xmm11,%xmm9 # cos *x4 + mulsd %xmm10,%xmm4 # cos *x4 + mulsd %xmm7,%xmm5 # sin *x3 + + mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term + mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term + + movsd %xmm2,%xmm6 # Keep high r for cos term + movsd %xmm13,%xmm7 # Keep high r for cos term + + subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0 + + subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx + subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx + + movhlps %xmm0,%xmm10 # move high x for x for sin term + movhlps %xmm1,%xmm11 # move high x for x*xx for cos term + + mulsd p_temp(%rsp),%xmm0 # x * xx + mulsd p_temp1+8(%rsp),%xmm11 # x * xx + + movsd %xmm2,%xmm12 # move -t for cos term + movsd %xmm13,%xmm3 # move -t for cos term + + addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t) + addsd .L__real_3ff0000000000000(%rip),%xmm13 # 1+(-t) + + addsd p_temp+8(%rsp),%xmm8 # sin+xx + addsd p_temp1(%rsp),%xmm5 # sin+xx + + subsd %xmm6,%xmm2 # (1-t) - r + subsd %xmm7,%xmm13 # (1-t) - r + + subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx + subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx + + + addsd %xmm10,%xmm8 # sin + x + addsd %xmm1,%xmm5 # sin + x + + addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx) + addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx) + + subsd %xmm12,%xmm4 # cos+t + subsd %xmm3,%xmm9 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + jmp .L__vrd4_sin_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_sinsin_piby4: + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + movapd %xmm2,p_temp2(%rsp) # store x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + movapd %xmm11,p_temp3(%rsp) # store r + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + mulpd %xmm2,%xmm10 # x4 + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + + movapd %xmm2,%xmm12 # copy of x2 for 0.5*x2 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) + + mulpd %xmm2,%xmm10 # x6 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd .L__real_3fe0000000000000(%rip),%xmm12 # 0.5 *x2 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm0,%xmm2 # x3 recalculate + mulpd %xmm3,%xmm3 # x4 recalculate + + movapd p_temp3(%rsp),%xmm13 # r + + mulpd %xmm6,%xmm12 # 0.5 * x2 *xx + mulpd %xmm1,%xmm7 # x * xx + + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # x4 * zc + + subpd %xmm12,%xmm4 # -0.5 * x2 *xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm6,%xmm4 # x3 * zs +xx + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + + subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1 + addpd %xmm0,%xmm4 # +x + subpd %xmm13,%xmm5 # + t + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lsinsin_coscos_piby4: + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + movapd %xmm3,p_temp3(%rsp) # store x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + movapd %xmm10,p_temp2(%rsp) # store r + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + mulpd %xmm3,%xmm11 # x4 + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for 0.5*x2 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm11 # x6 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd .L__real_3fe0000000000000(%rip),%xmm13 # 0.5 *x2 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zs + + mulpd %xmm2,%xmm2 # x4 recalculate + mulpd %xmm1,%xmm3 # x3 recalculate + + movapd p_temp2(%rsp),%xmm12 # r + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm7,%xmm13 # 0.5 * x2 *xx + subpd %xmm12,%xmm10 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x3 * zs + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx;;;;;;;;;;;;;;;;;;;;; + subpd %xmm13,%xmm5 # -0.5 * x2 *xx + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm7,%xmm5 # +xx + subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1 + addpd %xmm1,%xmm5 # +x + subpd %xmm12,%xmm4 # + t + + jmp .L__vrd4_sin_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_cossin_piby4: #Derive from cossin_coscos + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movapd .Lsincosarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm6,p_temp(%rsp) # rr + movhlps %xmm10,%xmm10 # get upper r for t for cos + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + movsd %xmm0,%xmm8 # lower x for sin + mulsd %xmm2,%xmm8 # lower x3 for sin + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # upper x4 for cos + movsd %xmm8,%xmm2 # lower x3 for sin + + movsd %xmm6,%xmm9 # lower xx + # note using odd reg + + movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term + movapd p_temp3(%rsp),%xmm13 # r + + mulpd %xmm0,%xmm6 # x * xx for upper cos term + mulpd %xmm1,%xmm7 # x * xx + movhlps %xmm6,%xmm6 + mulsd p_temp2(%rsp),%xmm9 # xx * 0.5*x2 for sin term + + subsd %xmm12,%xmm10 # (1 + (-t)) - r + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + # x3 * zs + + movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin + + subsd %xmm9,%xmm4 # x3zs - 0.5*x2*xx + + subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp(%rsp),%xmm4 # +xx + + + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + subsd %xmm12,%xmm8 # + t + addsd %xmm0,%xmm4 # +x + subpd %xmm13,%xmm5 # + t + + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lcoscos_sincos_piby4: #Derive from sincos_coscos + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm5 # c6 + movapd .Lcossinarray+0x20(%rip),%xmm8 # c3 + movapd .Lcosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm6,p_temp(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos + addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zszc + addpd %xmm9,%xmm5 # z + + mulpd %xmm0,%xmm2 # upper x3 for sin + mulsd %xmm0,%xmm2 # lower x4 for cos + mulpd %xmm3,%xmm3 # x4 + + movhlps %xmm6,%xmm9 # upper xx for sin term + # note using odd reg + + movlpd p_temp2(%rsp),%xmm12 # lower r for cos term + movapd p_temp3(%rsp),%xmm13 # r + + + mulpd %xmm0,%xmm6 # x * xx for lower cos term + mulpd %xmm1,%xmm7 # x * xx + + mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term + + subsd %xmm12,%xmm10 # (1 + (-t)) - r + subpd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + mulpd %xmm3,%xmm5 + # x4 * zc + + movhlps %xmm4,%xmm8 # xmm8= sin, xmm4= cos + subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx + + + subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp+8(%rsp),%xmm8 # +xx + + movhlps %xmm0,%xmm0 # upper x for sin + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + subsd %xmm12,%xmm4 # + t + subpd %xmm13,%xmm5 # + t + addsd %xmm0,%xmm8 # +x + + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lcossin_coscos_piby4: + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsincosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + movhlps %xmm11,%xmm11 # get upper r for t for cos + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zcs + + movsd %xmm1,%xmm9 # lower x for sin + mulsd %xmm3,%xmm9 # lower x3 for sin + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # upper x4 for cos + movsd %xmm9,%xmm3 # lower x3 for sin + + movsd %xmm7,%xmm8 # lower xx + # note using even reg + + movapd p_temp2(%rsp),%xmm12 # r + movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm1,%xmm7 # x * xx for upper cos term + movhlps %xmm7,%xmm7 + mulsd p_temp3(%rsp),%xmm8 # xx * 0.5*x2 for sin term + + subpd %xmm12,%xmm10 # (1 + (-t)) - r + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + # x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin + + subsd %xmm8,%xmm5 # x3zs - 0.5*x2*xx + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1(%rsp),%xmm5 # +xx + + + subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1 + subsd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1 + + subpd %xmm12,%xmm4 # + t + subsd %xmm13,%xmm9 # + t + addsd %xmm1,%xmm5 # +x + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lcossin_sinsin_piby4: # Derived from sincos_sinsin + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lsincosarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + + movhlps %xmm11,%xmm11 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm6,%xmm10 # 0.5*x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zczs + + movsd %xmm3,%xmm12 + mulsd %xmm1,%xmm12 # low x3 for sin + + mulpd %xmm0, %xmm2 # x3 + mulpd %xmm3, %xmm3 # high x4 for cos + movsd %xmm12,%xmm3 # low x3 for sin + + movhlps %xmm1,%xmm8 # upper x for cos term + # note using even reg + movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term + + mulsd p_temp1+8(%rsp),%xmm8 # x * xx for upper cos term + + mulsd p_temp3(%rsp),%xmm7 # xx * 0.5*x2 for lower sin term + + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin + + subsd %xmm7,%xmm5 # x3zs - 0.5*x2*xx + + subsd %xmm8,%xmm11 # ((1 + (-t)) - r) - x*xx + + subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx + addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1(%rsp),%xmm5 # +xx + + addpd %xmm6,%xmm4 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + + addsd %xmm1,%xmm5 # +x + addpd %xmm0,%xmm4 # +x + subsd %xmm13,%xmm9 # + t + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lsincos_coscos_piby4: + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcosarray+0x20(%rip),%xmm8 # c3 + movapd .Lcossinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3 + + addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm1,%xmm3 # upper x3 for sin + mulsd %xmm1,%xmm3 # lower x4 for cos + + movhlps %xmm7,%xmm8 # upper xx for sin term + # note using even reg + + movapd p_temp2(%rsp),%xmm12 # r + movlpd p_temp3(%rsp),%xmm13 # lower r for cos term + + mulpd %xmm0,%xmm6 # x * xx + mulpd %xmm1,%xmm7 # x * xx for lower cos term + + mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term + + subpd %xmm12,%xmm10 # (1 + (-t)) - r + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos + + subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx + + subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1+8(%rsp),%xmm9 # +xx + + movhlps %xmm1,%xmm1 # upper x for sin + subpd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + subpd %xmm12,%xmm4 # + t + subsd %xmm13,%xmm5 # + t + addsd %xmm1, %xmm9 # +x + + movlhps %xmm9, %xmm5 + + jmp .L__vrd4_sin_cleanup + + +.align 16 +.Lsincos_sinsin_piby4: # Derived from sincos_coscos + movapd %xmm2,%xmm10 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lcossinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm11,p_temp3(%rsp) # r + movapd %xmm7,p_temp1(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm6,%xmm10 # 0.5x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm0,%xmm2 # x3 + mulpd %xmm1,%xmm3 # upper x3 for sin + mulsd %xmm1,%xmm3 # lower x4 for cos + + movhlps %xmm7,%xmm8 # upper xx for sin term + # note using even reg + + movlpd p_temp3(%rsp),%xmm13 # lower r for cos term + + mulpd %xmm1,%xmm7 # x * xx for lower cos term + + mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term + + subsd %xmm13,%xmm11 # (1 + (-t)) - r + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos + + subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx + + subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx + + subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx + addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp1+8(%rsp),%xmm9 # +xx + + movhlps %xmm1,%xmm1 # upper x for sin + addpd %xmm6,%xmm4 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1 + + addsd %xmm1,%xmm9 # +x + addpd %xmm0,%xmm4 # +x + subsd %xmm13,%xmm5 # + t + + movlhps %xmm9,%xmm5 + + jmp .L__vrd4_sin_cleanup + + +.align 16 +.Lsinsin_cossin_piby4: # Derived from sincos_sinsin + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lsincosarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # x2 + movapd %xmm6,p_temp(%rsp) # xx + + movhlps %xmm10,%xmm10 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm7,%xmm11 # 0.5*x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zs + + + movsd %xmm2,%xmm13 + mulsd %xmm0,%xmm13 # low x3 for sin + + mulpd %xmm1,%xmm3 # x3 + mulpd %xmm2,%xmm2 # high x4 for cos + movsd %xmm13,%xmm2 # low x3 for sin + + + movhlps %xmm0,%xmm9 # upper x for cos term ; note using even reg + movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term + mulsd p_temp+8(%rsp),%xmm9 # x * xx for upper cos term + mulsd p_temp2(%rsp),%xmm6 # xx * 0.5*x2 for lower sin term + subsd %xmm12,%xmm10 # (1 + (-t)) - r + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin + subsd %xmm6,%xmm4 # x3zs - 0.5*x2*xx + + subsd %xmm9,%xmm10 # ((1 + (-t)) - r) - x*xx + + subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx + + addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp(%rsp),%xmm4 # +xx + + addpd %xmm7,%xmm5 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + + addsd %xmm0,%xmm4 # +x + addpd %xmm1,%xmm5 # +x + subsd %xmm12,%xmm8 # + t + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_sin_cleanup + +.align 16 +.Lsinsin_sincos_piby4: # Derived from sincos_coscos + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lcossinarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + + movapd %xmm10,p_temp2(%rsp) # r + movapd %xmm6,p_temp(%rsp) # rr + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos + + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + + addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm7,%xmm11 # 0.5x2*xx + addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos + + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd %xmm2,%xmm12 # x6 + mulpd %xmm3,%xmm13 # x6 + + addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm1,%xmm3 # x3 + mulpd %xmm0,%xmm2 # upper x3 for sin + mulsd %xmm0,%xmm2 # lower x4 for cos + + movhlps %xmm6,%xmm9 # upper xx for sin term + # note using even reg + + movlpd p_temp2(%rsp),%xmm12 # lower r for cos term + + mulpd %xmm0,%xmm6 # x * xx for lower cos term + + mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term + + subsd %xmm12,%xmm10 # (1 + (-t)) - r + + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + + movhlps %xmm4,%xmm8 # xmm9= sin, xmm5= cos + + subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx + + subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx + + subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx + addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx) + addsd p_temp+8(%rsp),%xmm8 # +xx + + movhlps %xmm0,%xmm0 # upper x for sin + addpd %xmm7,%xmm5 # +xx + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1 + + + addsd %xmm0,%xmm8 # +x + addpd %xmm1,%xmm5 # +x + subsd %xmm12,%xmm4 # + t + + movlhps %xmm8,%xmm4 + + jmp .L__vrd4_sin_cleanup + + +.align 16 +.Lsinsin_sinsin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +#DEBUG +# xorpd %xmm0, %xmm0 +# xorpd %xmm1, %xmm1 +# jmp .Lfinal_check +#DEBUG + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x50(%rip),%xmm4 # c6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # c6 + movapd .Lsinarray+0x20(%rip),%xmm8 # c3 + movapd .Lsinarray+0x20(%rip),%xmm9 # c3 + + movapd %xmm2,p_temp2(%rsp) # copy of x2 + movapd %xmm3,p_temp3(%rsp) # copy of x2 + + mulpd %xmm2,%xmm4 # c6*x2 + mulpd %xmm3,%xmm5 # c6*x2 + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6 + addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6 + addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3 + + mulpd %xmm2,%xmm10 # x6 + mulpd %xmm3,%xmm11 # x6 + + mulpd %xmm2,%xmm4 # x2(c5+x2c6) + mulpd %xmm3,%xmm5 # x2(c5+x2c6) + mulpd %xmm2,%xmm8 # x2(c2+x2C3) + mulpd %xmm3,%xmm9 # x2(c2+x2C3) + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5 *x2 + + addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6) + addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6) + addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3) + + mulpd %xmm6,%xmm2 # 0.5 * x2 *xx + mulpd %xmm7,%xmm3 # 0.5 * x2 *xx + + mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6)) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zs + + movapd p_temp2(%rsp),%xmm10 # x2 + movapd p_temp3(%rsp),%xmm11 # x2 + + mulpd %xmm0,%xmm10 # x3 + mulpd %xmm1,%xmm11 # x3 + + mulpd %xmm10,%xmm4 # x3 * zs + mulpd %xmm11,%xmm5 # x3 * zs + + subpd %xmm2,%xmm4 # -0.5 * x2 *xx + subpd %xmm3,%xmm5 # -0.5 * x2 *xx + + addpd %xmm6,%xmm4 # +xx + addpd %xmm7,%xmm5 # +xx + + addpd %xmm0,%xmm4 # +x + addpd %xmm1,%xmm5 # +x + + jmp .L__vrd4_sin_cleanup
diff --git a/src/gas/vrdasincos.S b/src/gas/vrdasincos.S new file mode 100644 index 0000000..d31e98a --- /dev/null +++ b/src/gas/vrdasincos.S
@@ -0,0 +1,1710 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrdasincos.s +# +# An array implementation of the sincos libm function. +# +# Prototype: +# +# void vrda_sincos(int n, double *x, double *ys, double *yc); +# +#Computes Sine of x for an array of input values. +#Places the results into the supplied ys array. +#Computes Cosine of x for an array of input values. +#Places the results into the supplied yc array. +#Does not perform error checking. +#Denormal inputs may produce unexpected results +#Author: Harsha Jagasia +#Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 16 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__real_jt_mask: .quad 0x0000000000000000F # + .quad 0x00000000000000000 # +.L__real_naninf_upper_sign_mask: .quad 0x000000000ffffffff # + .quad 0x000000000ffffffff # +.L__real_naninf_lower_sign_mask: .quad 0x0ffffffff00000000 # + .quad 0x0ffffffff00000000 # + +.Lcosarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03fa5555555555555 + .quad 0x0bf56c16c16c16967 # -0.00138889 c2 + .quad 0x0bf56c16c16c16967 + .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3 + .quad 0x03efa01a019f4ec90 + .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4 + .quad 0x0be927e4fa17f65f6 + .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5 + .quad 0x03e21eeb69037ab78 + .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6 + .quad 0x0bda907db46cc5e42 +.Lsinarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bfc5555555555555 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03f81111111110bb3 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0bf2a01a019e83e5c + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03ec71de3796cde01 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0be5ae600b42fdfa7 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x03de5e0b2f9a43bb8 +.Lsincosarray: + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x0bf56c16c16c16967 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x03efa01a019f4ec90 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x0be927e4fa17f65f6 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x03e21eeb69037ab78 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + .quad 0x0bda907db46cc5e42 + + +.Lcossinarray: + .quad 0x03fa5555555555555 # 0.0416667 c1 + .quad 0x0bfc5555555555555 # -0.166667 s1 + .quad 0x0bf56c16c16c16967 + .quad 0x03f81111111110bb3 # 0.00833333 s2 + .quad 0x03efa01a019f4ec90 + .quad 0x0bf2a01a019e83e5c # -0.000198413 s3 + .quad 0x0be927e4fa17f65f6 + .quad 0x03ec71de3796cde01 # 2.75573e-006 s4 + .quad 0x03e21eeb69037ab78 + .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5 + .quad 0x0bda907db46cc5e42 + .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + .weak vrda_sincos_ + .set vrda_sincos_,__vrda_sincos__ + .weak vrda_sincos__ + .set vrda_sincos__,__vrda_sincos__ + +.text + .align 16 + .p2align 4,,15 + +#x/* a FORTRAN subroutine implementation of array sincos +#** VRDA_SINCOS(N,X,YS,YC) +# C equivalent*/ +#void vrda_sincos__( int * n, double *x, double *ys, double *yc) +#{ +# vrda_sincos(*n,x,y); +#} +.globl __vrda_sincos__ + .type __vrda_sincos__,@function +__vrda_sincos__: + mov (%rdi),%edi +.align 16 +.p2align 4,,15 + +# define local variable storage offsets +.equ save_xmm6, 0x00 # temporary for get/put bits operation +.equ save_xmm7, 0x10 # temporary for get/put bits operation +.equ save_xmm8, 0x20 # temporary for get/put bits operation +.equ save_xmm9, 0x30 # temporary for get/put bits operation +.equ save_xmm10, 0x40 # temporary for get/put bits operation +.equ save_xmm11, 0x50 # temporary for get/put bits operation +.equ save_xmm12, 0x60 # temporary for get/put bits operation +.equ save_xmm13, 0x70 # temporary for get/put bits operation +.equ save_xmm14, 0x80 # temporary for get/put bits operation +.equ save_xmm15, 0x90 # temporary for get/put bits operation + +.equ save_rdi, 0x0A0 +.equ save_rsi, 0x0B0 +.equ save_rbx, 0x0C0 + +.equ r, 0x0D0 # pointer to r for remainder_piby2 +.equ rr, 0x0E0 # pointer to r for remainder_piby2 +.equ rsq, 0x0F0 +.equ region, 0x0100 # pointer to r for remainder_piby2 + +.equ r1, 0x0110 # pointer to r for remainder_piby2 +.equ rr1, 0x0120 # pointer to r for remainder_piby2 +.equ rsq1, 0x0130 +.equ region1, 0x0140 # pointer to r for remainder_piby2 + +.equ p_temp, 0x0150 # temporary for get/put bits operation +.equ p_temp1, 0x0160 # temporary for get/put bits operation + +.equ p_temp2, 0x0170 # temporary for get/put bits operation +.equ p_temp3, 0x0180 # temporary for get/put bits operation + +.equ p_temp4, 0x0190 # temporary for get/put bits operation +.equ p_temp5, 0x01A0 # temporary for get/put bits operation + +.equ p_temp6, 0x01B0 # temporary for get/put bits operation +.equ p_temp7, 0x01C0 # temporary for get/put bits operation + +.equ p_original, 0x01D0 # original x +.equ p_mask, 0x01E0 # original x +.equ p_signs, 0x01F0 # original x +.equ p_signc, 0x0200 # original x +.equ p_region, 0x0210 + +.equ p_original1, 0x0220 # original x +.equ p_mask1, 0x0230 # original x +.equ p_signs1, 0x0240 # original x +.equ p_signc1, 0x0250 # original x +.equ p_region1, 0x0260 + +.equ save_r12, 0x0270 # temporary for get/put bits operation +.equ save_r13, 0x0280 # temporary for get/put bits operation + +.equ save_r14, 0x0290 # temporary for get/put bits operation +.equ save_r15, 0x02A0 # temporary for get/put bits operation + +.equ save_xa, 0x02B0 # qword ; leave space for 4 args***** +.equ save_ysa, 0x02C0 # qword ; leave space for 4 args***** +.equ save_yca, 0x02D0 # qword ; leave space for 4 args***** + +.equ save_nv, 0x02E0 # qword +.equ p_iter, 0x02F0 # qword storage for number of loop iterations + + +.globl vrda_sincos + .type vrda_sincos,@function +vrda_sincos: + + sub $0x0308,%rsp + + mov %r12,save_r12(%rsp) # save r12 + mov %r13,save_r13(%rsp) # save r13 + mov %rbx,save_rbx(%rsp) # save rbx + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#START PROCESS INPUT +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ysa(%rsp) # save ysin_array pointer + mov %rcx,save_yca(%rsp) # save ycos_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + + mov %rdi,save_nv(%rsp) # save number of values + # see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vrda_cleanup # jump if only single calls + # prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#START LOOP +.align 16 +.L__vrda_top: +# build the input _m128d + movapd .L__real_7fffffffffffffff(%rip),%xmm2 + mov .L__real_7fffffffffffffff(%rip),%rdx + + mov save_xa(%rsp),%rsi # get x_array pointer + movlpd (%rsi),%xmm0 + movhpd 8(%rsi),%xmm0 + mov (%rsi),%rax + mov 8(%rsi),%rcx + movdqa %xmm0,%xmm6 + movdqa %xmm0,p_original(%rsp) + + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + movlpd -16(%rsi), %xmm1 + movhpd -8(%rsi), %xmm1 + mov -16(%rsi), %r8 + mov -8(%rsi), %r9 + movdqa %xmm1,%xmm7 + movdqa %xmm1,p_original1(%rsp) + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#STARTMAIN + +andpd %xmm2,%xmm0 #Unsign +andpd %xmm2,%xmm1 #Unsign + +and %rdx,%rax +and %rdx,%rcx +and %rdx,%r8 +and %rdx,%r9 + +movdqa %xmm0,%xmm12 +movdqa %xmm1,%xmm13 + +pcmpgtd %xmm6,%xmm12 +pcmpgtd %xmm7,%xmm13 +movdqa %xmm12,%xmm6 +movdqa %xmm13,%xmm7 +psrldq $4,%xmm12 +psrldq $4,%xmm13 +psrldq $8,%xmm6 +psrldq $8,%xmm7 + +mov $0x3FE921FB54442D18,%rdx #piby4 + +mov $0x411E848000000000,%r10 #5e5 + +movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + + +por %xmm6,%xmm12 +por %xmm7,%xmm13 +movd %xmm12,%r12 #Move Sign to gpr ** +movd %xmm13,%r13 #Move Sign to gpr ** + +movapd %xmm0,%xmm2 #x0 +movapd %xmm1,%xmm3 #x1 +movapd %xmm0,%xmm6 #x0 +movapd %xmm1,%xmm7 #x1 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm2 = x, xmm4 =0.5, xmm6 =x +# xmm3 = x, xmm5 =0.5, xmm7 =x +.align 16 +.Leither_or_both_arg_gt_than_piby4: + cmp %r10,%rax + jae .Lfirst_or_next3_arg_gt_5e5 + + cmp %r10,%rcx + jae .Lsecond_or_next2_arg_gt_5e5 + + cmp %r10,%r8 + jae .Lthird_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfourth_arg_gt_5e5 + + +# /* Find out what multiple of piby2 */ +# npi2 = (int)(x * twobypi + 0.5); + movapd .L__real_3fe45f306dc9c883(%rip),%xmm0 + mulpd %xmm0,%xmm2 # * twobypi + mulpd %xmm0,%xmm3 # * twobypi + + addpd %xmm4,%xmm2 # +0.5, npi2 + addpd %xmm4,%xmm3 # +0.5, npi2 + + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + + xorpd %xmm12,%xmm12 + + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + + cvtdq2pd %xmm4,%xmm2 # and back to double. + cvtdq2pd %xmm5,%xmm3 # and back to double. + + +# /* Subtract the multiple from x to get an extra-precision remainder */ + + movd %xmm4,%r8 # Region + movd %xmm5,%r9 # Region + + mov .L__reald_one_zero(%rip),%rdx # compare value for cossin path + mov %r8,%r10 # For Sign of Sin + mov %r9,%r11 + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm1,%xmm7 # t-rhead + + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = npi2 (int), xmm0 =rhead, xmm8 =rtail +# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail + + pand .L__reald_one_one(%rip),%xmm4 #odd/even region for cos/sin + pand .L__reald_one_one(%rip),%xmm5 #odd/even region for cos/sin + + pcmpeqd %xmm12,%xmm4 + pcmpeqd %xmm12,%xmm5 + + punpckldq %xmm4,%xmm4 + punpckldq %xmm5,%xmm5 + + movapd %xmm4,p_region(%rsp) + movapd %xmm5,p_region1(%rsp) + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + mov %r10,%rax + mov %r11,%rcx + + not %r12 #ADDED TO CHANGE THE LOGIC + not %r13 #ADDED TO CHANGE THE LOGIC + and %r12,%r10 + and %r13,%r11 + + not %rax + not %rcx + not %r12 + not %r13 + and %r12,%rax + and %r13,%rcx + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_signs(%rsp) #write out lower sign bit + mov %r12,p_signs+8(%rsp) #write out upper sign bit + mov %r11,p_signs1(%rsp) #write out lower sign bit + mov %r13,p_signs1+8(%rsp) #write out upper sign bit + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead +# xmm4 = Sign, xmm0 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail + movapd %xmm0,%xmm6 # rhead + movapd %xmm1,%xmm7 # rhead + + subpd %xmm8,%xmm0 # r = rhead - rtail + subpd %xmm9,%xmm1 # r = rhead - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail + + subpd %xmm0,%xmm6 #rr=rhead-r + subpd %xmm1,%xmm7 #rr=rhead-r + + movapd %xmm0,%xmm2 #move r for r2 + movapd %xmm1,%xmm3 #move r for r2 + + mulpd %xmm0,%xmm2 #r2 + mulpd %xmm1,%xmm3 #r2 + + subpd %xmm8,%xmm6 #rr=(rhead-r) -rtail + subpd %xmm9,%xmm7 #rr=(rhead-r) -rtail + + + add .L__reald_one_one(%rip),%r8 + add .L__reald_one_one(%rip),%r9 + + and .L__reald_two_two(%rip),%r8 + and .L__reald_two_two(%rip),%r9 + + shr $1,%r8 + shr $1,%r9 + + mov %r8,%r12 + mov %r9,%r13 + + and .L__reald_one_zero(%rip),%r12 #mask out the lower sign bit leaving the upper sign bit + and .L__reald_one_zero(%rip),%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r8 #shift lower sign bit left by 63 bits + shl $63,%r9 #shift lower sign bit left by 63 bits + + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r8,p_signc(%rsp) #write out lower sign bit + mov %r12,p_signc+8(%rsp) #write out upper sign bit + mov %r9,p_signc1(%rsp) #write out lower sign bit + mov %r13,p_signc1+8(%rsp) #write out upper sign bit + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsinsin_sinsin_piby4: + + movapd %xmm0,p_temp(%rsp) # copy of x + movapd %xmm1,p_temp1(%rsp) # copy of x + + movapd %xmm2,%xmm10 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x50(%rip),%xmm4 # s6 + movdqa .Lsinarray+0x50(%rip),%xmm5 # s6 + movapd .Lsinarray+0x20(%rip),%xmm8 # s3 + movapd .Lsinarray+0x20(%rip),%xmm9 # s3 + + movdqa .Lcosarray+0x50(%rip),%xmm12 # c6 + movdqa .Lcosarray+0x50(%rip),%xmm13 # c6 + movapd .Lcosarray+0x20(%rip),%xmm14 # c3 + movapd .Lcosarray+0x20(%rip),%xmm15 # c3 + + movapd %xmm2,p_temp2(%rsp) # copy of x2 + movapd %xmm3,p_temp3(%rsp) # copy of x2 + + mulpd %xmm2,%xmm4 # s6*x2 + mulpd %xmm3,%xmm5 # s6*x2 + mulpd %xmm2,%xmm8 # s3*x2 + mulpd %xmm3,%xmm9 # s3*x2 + + mulpd %xmm2,%xmm12 # s6*x2 + mulpd %xmm3,%xmm13 # s6*x2 + mulpd %xmm2,%xmm14 # s3*x2 + mulpd %xmm3,%xmm15 # s3*x2 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsinarray+0x40(%rip),%xmm4 # s5+x2s6 + addpd .Lsinarray+0x40(%rip),%xmm5 # s5+x2s6 + addpd .Lsinarray+0x10(%rip),%xmm8 # s2+x2C3 + addpd .Lsinarray+0x10(%rip),%xmm9 # s2+x2C3 + + addpd .Lcosarray+0x40(%rip),%xmm12 # c5+x2c6 + addpd .Lcosarray+0x40(%rip),%xmm13 # c5+x2c6 + addpd .Lcosarray+0x10(%rip),%xmm14 # c2+x2C3 + addpd .Lcosarray+0x10(%rip),%xmm15 # c2+x2C3 + + mulpd %xmm2,%xmm10 # x6 + mulpd %xmm3,%xmm11 # x6 + + mulpd %xmm2,%xmm4 # x2(s5+x2s6) + mulpd %xmm3,%xmm5 # x2(s5+x2s6) + mulpd %xmm2,%xmm8 # x2(s2+x2C3) + mulpd %xmm3,%xmm9 # x2(s2+x2C3) + + mulpd %xmm2,%xmm12 # x2(s5+x2s6) + mulpd %xmm3,%xmm13 # x2(s5+x2s6) + mulpd %xmm2,%xmm14 # x2(s2+x2C3) + mulpd %xmm3,%xmm15 # x2(s2+x2C3) + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5 *x2 + + addpd .Lsinarray+0x30(%rip),%xmm4 # s4 + x2(s5+x2s6) + addpd .Lsinarray+0x30(%rip),%xmm5 # s4 + x2(s5+x2s6) + addpd .Lsinarray(%rip),%xmm8 # s1 + x2(s2+x2C3) + addpd .Lsinarray(%rip),%xmm9 # s1 + x2(s2+x2C3) + + movapd %xmm2,p_temp4(%rsp) # copy of r + movapd %xmm3,p_temp5(%rsp) # copy of r + + movapd %xmm2,%xmm0 # r + movapd %xmm3,%xmm1 # r + + addpd .Lcosarray+0x30(%rip),%xmm12 # c4 + x2(c5+x2c6) + addpd .Lcosarray+0x30(%rip),%xmm13 # c4 + x2(c5+x2c6) + addpd .Lcosarray(%rip),%xmm14 # c1 + x2(c2+x2C3) + addpd .Lcosarray(%rip),%xmm15 # c1 + x2(c2+x2C3) + + mulpd %xmm6,%xmm2 # 0.5 * x2 *xx + mulpd %xmm7,%xmm3 # 0.5 * x2 *xx + + subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 + subpd .L__real_3ff0000000000000(%rip),%xmm1 # -t=r-1.0 + + mulpd %xmm10,%xmm4 # x6(s4 + x2(s5+x2s6)) + mulpd %xmm11,%xmm5 # x6(s4 + x2(s5+x2s6)) + + mulpd %xmm10,%xmm12 # x6(c4 + x2(c5+x2c6)) + mulpd %xmm11,%xmm13 # x6(c4 + x2(c5+x2c6)) + + addpd .L__real_3ff0000000000000(%rip),%xmm0 # 1+(-t) + addpd .L__real_3ff0000000000000(%rip),%xmm1 # 1+(-t) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zs + + addpd %xmm14,%xmm12 # zc + addpd %xmm15,%xmm13 # zc + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = 0.5 * x2 *xx, xmm4 = zs, xmm12 = zc, xmm6 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = 0.5 * x2 *xx, xmm5 = zs, xmm13 = zc, xmm7 =rr + +# Free +# %xmm8,,%xmm10 xmm14 +# %xmm9,,%xmm11 xmm15 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd p_temp2(%rsp),%xmm10 # x2 for x3 + movapd p_temp3(%rsp),%xmm11 # x2 for x3 + + movapd %xmm10,%xmm8 # x2 for x4 + movapd %xmm11,%xmm9 # x2 for x4 + + movapd p_temp(%rsp),%xmm14 # x for x*xx + movapd p_temp1(%rsp),%xmm15 # x for x*xx + + subpd p_temp4(%rsp),%xmm0 # (1 + (-t)) - r + subpd p_temp5(%rsp),%xmm1 # (1 + (-t)) - r + + mulpd %xmm14,%xmm10 # x3 + mulpd %xmm15,%xmm11 # x3 + + mulpd %xmm8,%xmm8 # x4 + mulpd %xmm9,%xmm9 # x4 + + mulpd %xmm6,%xmm14 # x*xx + mulpd %xmm7,%xmm15 # x*xx + + mulpd %xmm10,%xmm4 # x3 * zs + mulpd %xmm11,%xmm5 # x3 * zs + + mulpd %xmm8,%xmm12 # x4 * zc + mulpd %xmm9,%xmm13 # x4 * zc + + subpd %xmm2,%xmm4 # x3*zs-0.5 * x2 *xx + subpd %xmm3,%xmm5 # x3*zs-0.5 * x2 *xx + + subpd %xmm14,%xmm0 # ((1 + (-t)) - r) -x*xx + subpd %xmm15,%xmm1 # ((1 + (-t)) - r) -x*xx + + + movapd p_temp4(%rsp),%xmm10 # r for t + movapd p_temp5(%rsp),%xmm11 # r for t + + addpd %xmm6,%xmm4 # sin+xx + addpd %xmm7,%xmm5 # sin+xx + + addpd %xmm0,%xmm12 # x4*zc + (((1 + (-t)) - r) - x*xx) + addpd %xmm1,%xmm13 # x4*zc + (((1 + (-t)) - r) - x*xx) + + subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + + movapd p_region(%rsp),%xmm2 + movapd p_region1(%rsp),%xmm3 + + movapd %xmm2,%xmm8 + movapd %xmm3,%xmm9 + + addpd p_temp(%rsp),%xmm4 # sin+xx+x + addpd p_temp1(%rsp),%xmm5 # sin+xx+x + + subpd %xmm10,%xmm12 # cos + (-t) + subpd %xmm11,%xmm13 # cos + (-t) + +# xmm4 = sin, xmm5 = sin +# xmm12 = cos, xmm13 = cos + + andnpd %xmm4,%xmm8 + andnpd %xmm5,%xmm9 + + andpd %xmm2,%xmm4 + andpd %xmm3,%xmm5 + + andnpd %xmm12,%xmm2 + andnpd %xmm13,%xmm3 + + andpd p_region(%rsp),%xmm12 + andpd p_region1(%rsp),%xmm13 + + orpd %xmm2,%xmm4 + orpd %xmm3,%xmm5 + + orpd %xmm8,%xmm12 + orpd %xmm9,%xmm13 + + jmp .L__vrd4_sin_cleanup + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfirst_or_next3_arg_gt_5e5: +# %rcx,,%rax r8, r9 + + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Be sure not to use %xmm3,%xmm1 and xmm7 +# Use %xmm8,,%xmm5 xmm10, xmm12 +# %xmm11,,%xmm9 xmm13 + + + movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm6,%xmm6 + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1 + cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm6,%xmm0 + subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm6 # rr=rhead-r + subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm0,r+8(%rsp) # store upper r + movlpd %xmm6,rr+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_lower_naninf + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrd4_sin_lower_naninf: + mov p_original(%rsp),%rax # upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) # rr = 0 + mov %r10d,region(%rsp) # region =0 + and .L__real_naninf_lower_sign_mask(%rip),%r12 # Sign +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + + movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_lower_naninf_of_both_gt_5e5 + + mov %rcx,p_temp(%rsp) #Save upper arg + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea rr(%rsp),%rsi + lea r(%rsp),%rdi + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%rcx #Restore upper arg + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrd4_sin_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf + mov p_original(%rsp),%rax + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr(%rsp) #rr = 0 + mov %r10d,region(%rsp) #region = 0 + and .L__real_naninf_lower_sign_mask(%rip),%r12 # Sign + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_upper_naninf_of_both_gt_5e5 + + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrd4_sin_upper_naninf_of_both_gt_5e5: + mov p_original+8(%rsp),%rcx #upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) #rr = 0 + mov %r10d,region+4(%rsp) #region = 0 + and .L__real_naninf_upper_sign_mask(%rip),%r12 # Sign +.align 16 +0: + jmp .Lcheck_next2_args + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsecond_or_next2_arg_gt_5e5: +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Restore xmm4 and %xmm3,,%xmm1 xmm7 +# Can use %xmm10,,%xmm8 xmm12 +# %xmm9,,%xmm5 xmm11, xmm13 + + movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call +# movlhps %xmm0,%xmm0 #Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case +# movlhps %xmm2,%xmm2 +# movlhps %xmm6,%xmm6 + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1 + cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm1 = piby2_2 + cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store upper region + movsd %xmm6,%xmm0 + subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail) + subsd %xmm0,%xmm6 # rr=rhead-r + subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm0,r(%rsp) # store upper r + movlpd %xmm6,rr(%rsp) # store upper rr + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_upper_naninf + + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea rr+8(%rsp),%rsi + lea r+8(%rsp),%rdi + movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrd4_sin_upper_naninf: + mov p_original+8(%rsp),%rcx # upper arg is nan/inf +# mov r+8(%rsp),%rcx ; upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr+8(%rsp) # rr = 0 + mov %r10d,region+4(%rsp) # region =0 + and .L__real_naninf_upper_sign_mask(%rip),%r12 # Sign + +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcheck_next2_args: + + mov $0x411E848000000000,%r10 #5e5 + + + cmp %r10,%r8 + jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfirst_second_done_fourth_arg_gt_5e5 + +# Work on next two args, both < 5e5 +# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5 + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi + addpd %xmm4,%xmm3 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm5,region1(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm1,%xmm7 # t-rhead + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm1,%xmm7 # rhead + subpd %xmm9,%xmm1 # r = rhead - rtail + movapd %xmm1,r1(%rsp) + + subpd %xmm1,%xmm7 # rr=rhead-r + subpd %xmm9,%xmm7 # rr=(rhead-r) -rtail + movapd %xmm7,rr1(%rsp) + + jmp .L__vrd4_sin_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lthird_or_fourth_arg_gt_5e5: +#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Can use %xmm11,,%xmm9 xmm13 +# %xmm8,,%xmm5 xmm10, xmm12 +# Restore xmm4 + +# Work on first two args, both < 5e5 + +#DEBUG +# movapd %xmm2, %xmm4 +# movapd %xmm1, %xmm5 +# movapd %xmm2, %xmm12 +# movapd %xmm1, %xmm13 +# jmp .L__vrd4_sin_cleanup +#DEBUG + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm4,region(%rsp) # Region + +#DEBUG +# movapd region(%rsp), %xmm4 +# movapd %xmm1, %xmm5 +# movapd region(%rsp), %xmm12 +# movapd %xmm1, %xmm13 +# jmp .L__vrd4_sin_cleanup +#DEBUG + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm0,%xmm6 # rhead + subpd %xmm8,%xmm0 # r = rhead - rtail + movapd %xmm0,r(%rsp) + + subpd %xmm0,%xmm6 # rr=rhead-r + subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail + movapd %xmm6,rr(%rsp) + + +# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_third_or_fourth_arg_gt_5e5: +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + +#DEBUG +# movapd region(%rsp), %xmm4 +# movapd %xmm1, %xmm5 +# movapd region(%rsp), %xmm12 +# movapd %xmm1, %xmm13 +# jmp .L__vrd4_sin_cleanup +#DEBUG + + mov $0x411E848000000000,%r10 #5e5 + + + cmp %r10,%r9 + jae .Lboth_arg_gt_5e5_higher + + +# Upper Arg is <5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg + movhlps %xmm3,%xmm3 + movhlps %xmm7,%xmm7 + movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r9d,region1+4(%rsp) # store upper region + movsd %xmm7,%xmm1 + subsd %xmm0,%xmm1 # xmm1 = r=(rhead-rtail) + subsd %xmm1,%xmm7 # rr=rhead-r + subsd %xmm0,%xmm7 # xmm7 = rr=((rhead-r) -rtail) + movlpd %xmm1,r1+8(%rsp) # store upper r + movlpd %xmm7,rr1+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_lower_naninf_higher + + lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf + lea rr1(%rsp),%rsi + lea r1(%rsp),%rdi + movlpd r1(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_sin_lower_naninf_higher: + mov p_original1(%rsp),%r8 # upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1(%rsp) # rr = 0 + mov %r10d,region1(%rsp) # region =0 + and .L__real_naninf_lower_sign_mask(%rip),%r13 # Sign + +.align 16 +0: + + +#DEBUG +# movapd r(%rsp), %xmm4 +# movapd r1(%rsp), %xmm5 +# movapd r(%rsp), %xmm12 +# movapd r1(%rsp), %xmm13 +# jmp .L__vrd4_sin_cleanup +#DEBUG + + + jmp .L__vrd4_sin_reconstruct + +.align 16 +.Lboth_arg_gt_5e5_higher: +# Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher + + mov %r9,p_temp1(%rsp) #Save upper arg + lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf + lea rr1(%rsp),%rsi + lea r1(%rsp),%rdi + movsd %xmm1,%xmm0 + call __amd_remainder_piby2@PLT + mov p_temp1(%rsp),%r9 #Restore upper arg + + jmp 0f + +.L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf + mov p_original1(%rsp),%r8 + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1(%rsp) #rr = 0 + mov %r10d,region1(%rsp) #region = 0 + and .L__real_naninf_lower_sign_mask(%rip),%r13 # Sign + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher + + lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea rr1+8(%rsp),%rsi + lea r1+8(%rsp),%rdi + movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher: + mov p_original1+8(%rsp),%r9 #upper arg is nan/inf +# movd %xmm6,%r9 ;upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1+8(%rsp) #rr = 0 + mov %r10d,region1+4(%rsp) #region = 0 + and .L__real_naninf_upper_sign_mask(%rip),%r13 # Sign + +.align 16 +0: + + jmp .L__vrd4_sin_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfourth_arg_gt_5e5: +#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5 +#%rcx,,%rax r8, r9 +#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5 + +# Work on first two args, both < 5e5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + movq %xmm4,region(%rsp) # Region + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm0 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm0 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm0 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm0,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + movapd %xmm0,%xmm6 # rhead + subpd %xmm8,%xmm0 # r = rhead - rtail + movapd %xmm0,r(%rsp) + + subpd %xmm0,%xmm6 # rr=rhead-r + subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail + movapd %xmm6,rr(%rsp) + + +# Work on next two args, third arg < 5e5, fourth arg >= 5e5 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_fourth_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call +# movlhps %xmm1,%xmm1 #Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case +# movlhps %xmm3,%xmm3 +# movlhps %xmm7,%xmm7 + movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r8d,region1(%rsp) # store lower region + movsd %xmm7,%xmm1 + subsd %xmm0,%xmm1 # xmm0 = r=(rhead-rtail) + subsd %xmm1,%xmm7 # rr=rhead-r + subsd %xmm0,%xmm7 # xmm6 = rr=((rhead-r) -rtail) + + movlpd %xmm1,r1(%rsp) # store upper r + movlpd %xmm7,rr1(%rsp) # store upper rr + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrd4_sin_upper_naninf_higher + + lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea rr1+8(%rsp),%rsi + lea r1+8(%rsp),%rdi + movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call + call __amd_remainder_piby2@PLT + jmp 0f + +.L__vrd4_sin_upper_naninf_higher: + mov p_original1+8(%rsp),%r9 # upper arg is nan/inf +# mov r1+8(%rsp),%r9 ; upper arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000 + xor %r10,%r10 + mov %r10,rr1+8(%rsp) # rr = 0 + mov %r10d,region1+4(%rsp) # region =0 + and .L__real_naninf_upper_sign_mask(%rip),%r13 # Sign + +.align 16 +0: + jmp .L__vrd4_sin_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrd4_sin_reconstruct: +#Results +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#DEBUG +# movapd region(%rsp), %xmm4 +# movapd region1(%rsp), %xmm5 +# movapd region(%rsp), %xmm12 +# movapd region1(%rsp), %xmm13 +# jmp .L__vrd4_sin_cleanup +#DEBUG + + + movapd r(%rsp),%xmm0 + movapd r1(%rsp),%xmm1 + + movapd rr(%rsp),%xmm6 + movapd rr1(%rsp),%xmm7 + + mov region(%rsp),%r8 + mov region1(%rsp),%r9 + + movlpd region(%rsp),%xmm4 + movlpd region1(%rsp),%xmm5 + + pand .L__reald_one_one(%rip),%xmm4 #odd/even region for cos/sin + pand .L__reald_one_one(%rip),%xmm5 #odd/even region for cos/sin + + xorpd %xmm12,%xmm12 + pcmpeqd %xmm12,%xmm4 + pcmpeqd %xmm12,%xmm5 + + punpckldq %xmm4,%xmm4 + punpckldq %xmm5,%xmm5 + + movapd %xmm4,p_region(%rsp) + movapd %xmm5,p_region1(%rsp) + + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + mov %r8,%r10 + mov %r9,%r11 + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + mov %r10,%rax + mov %r11,%rcx + + not %r12 #ADDED TO CHANGE THE LOGIC + not %r13 #ADDED TO CHANGE THE LOGIC + and %r12,%r10 + and %r13,%r11 + + not %rax + not %rcx + not %r12 + not %r13 + and %r12,%rax + and %r13,%rcx + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_signs(%rsp) #write out lower sign bit + mov %r12,p_signs+8(%rsp) #write out upper sign bit + mov %r11,p_signs1(%rsp) #write out lower sign bit + mov %r13,p_signs1+8(%rsp) #write out upper sign bit + + movapd %xmm0,%xmm2 # r + movapd %xmm1,%xmm3 # r + + mulpd %xmm0,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + add .L__reald_one_one(%rip),%r8 + add .L__reald_one_one(%rip),%r9 + + and .L__reald_two_two(%rip),%r8 + and .L__reald_two_two(%rip),%r9 + + shr $1,%r8 + shr $1,%r9 + + mov %r8,%rax + mov %r9,%rcx + + and .L__reald_one_zero(%rip),%rax #mask out the lower sign bit leaving the upper sign bit + and .L__reald_one_zero(%rip),%rcx #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r8 #shift lower sign bit left by 63 bits + shl $63,%r9 #shift lower sign bit left by 63 bits + + shl $31,%rax #shift upper sign bit left by 31 bits + shl $31,%rcx #shift upper sign bit left by 31 bits + + mov %r8,p_signc(%rsp) #write out lower sign bit + mov %rax,p_signc+8(%rsp) #write out upper sign bit + mov %r9,p_signc1(%rsp) #write out lower sign bit + mov %rcx,p_signc1+8(%rsp) #write out upper sign bit + + jmp .Lsinsin_sinsin_piby4 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrd4_sin_cleanup: + + xorpd p_signs(%rsp),%xmm4 # (+) Sign + xorpd p_signs1(%rsp),%xmm5 # (+) Sign + + xorpd p_signc(%rsp),%xmm12 # (+) Sign + xorpd p_signc1(%rsp),%xmm13 # (+) Sign + +.L__vrda_bottom1: +# store the result _m128d + mov save_ysa(%rsp),%rdi # get ysin_array pointer + mov save_yca(%rsp),%rbx # get ycos_array pointer + + movlpd %xmm4,(%rdi) + movhpd %xmm4,8(%rdi) + + movlpd %xmm12,(%rbx) + movhpd %xmm12,8(%rbx) + +.L__vrda_bottom2: + + prefetch 64(%rdi) + prefetch 64(%rbx) + + add $32,%rdi + add $32,%rbx + + mov %rdi,save_ysa(%rsp) # save ysin_array pointer + mov %rbx,save_yca(%rsp) # save ycos_array pointer + +# store the result _m128d + movlpd %xmm5, -16(%rdi) + movhpd %xmm5, -8(%rdi) + + movlpd %xmm13, -16(%rbx) + movhpd %xmm13, -8(%rbx) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vrda_top + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vrda_cleanup + +.L__final_check: + + mov save_r12(%rsp),%r12 # restore r12 + mov save_r13(%rsp),%r13 # restore r13 + mov save_rbx(%rsp),%rbx # restore rbx + + add $0x0308,%rsp + ret + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# we jump here when we have an odd number of cos calls to make at the end +# we assume that rdx is pointing at the next x array element, r8 at the next y array element. +# The number of values left is in save_nv + +.align 16 +.L__vrda_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__final_check # exit if not + + mov save_xa(%rsp),%rsi + +# fill in a m128d with zeroes and the extra values and then make a recursive call. + xorpd %xmm0,%xmm0 + movlpd %xmm0,p_temp+8(%rsp) + movapd %xmm0,p_temp+16(%rsp) + + mov (%rsi),%rcx # we know there's at least one + mov %rcx,p_temp(%rsp) + cmp $2,%rax + jl .L__vrdacg + + mov 8(%rsi),%rcx # do the second value + mov %rcx,p_temp+8(%rsp) + cmp $3,%rax + jl .L__vrdacg + + mov 16(%rsi),%rcx # do the third value + mov %rcx,p_temp+16(%rsp) + +.L__vrdacg: + mov $4,%rdi # parameter for N + lea p_temp(%rsp),%rsi # &x parameter + lea p_temp2(%rsp),%rdx # &ys parameter + lea p_temp4(%rsp),%rcx # &yc parameter + + call vrda_sincos@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ysa(%rsp),%rdi + mov save_yca(%rsp),%rbx + mov save_nv(%rsp),%rax # get number of values + + mov p_temp2(%rsp),%rcx + mov %rcx,(%rdi) # we know there's at least one + mov p_temp4(%rsp),%rdx + mov %rdx,(%rbx) # we know there's at least one + cmp $2,%rax + jl .L__vrdacgf + + mov p_temp2+8(%rsp),%rcx + mov %rcx,8(%rdi) # do the second value + mov p_temp4+8(%rsp),%rdx + mov %rdx,8(%rbx) # do the second value + cmp $3,%rax + jl .L__vrdacgf + + mov p_temp2+16(%rsp),%rcx + mov %rcx,16(%rdi) # do the third value + mov p_temp4+16(%rsp),%rdx + mov %rdx,16(%rbx) # do the third value + +.L__vrdacgf: + jmp .L__final_check
diff --git a/src/gas/vrs4cosf.S b/src/gas/vrs4cosf.S new file mode 100644 index 0000000..ab59058 --- /dev/null +++ b/src/gas/vrs4cosf.S
@@ -0,0 +1,2122 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrs4_cosf.s +# +# A vector implementation of the cosf libm function. +# +# Prototype: +# +# __m128 __vrs4_cosf(__m128 x); +# +# Computes Cosine of x for an array of input values. +# Places the results into the supplied y array. +# Does not perform error checking. +# Denormal inputs may produce unexpected results. +# This routine computes 4 single precision Cosine values at a time. +# The four values are passed as packed single in xmm10. +# The four results are returned as packed singles in xmm10. +# Note that this represents a non-standard ABI usage, as no ABI +# ( and indeed C) currently allows returning 4 values for a function. +# It is expected that some compilers may be able to take advantage of this +# interface when implementing vectorized loops. Using the array implementation +# of the routine requires putting the inputs into memory, and retrieving +# the results from memory. This routine eliminates the need for this +# overhead if the data does not already reside in memory. +# Author: Harsha Jagasia +# Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 64 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # + +.Lcosarray: + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x03FA5555555502F31 + .quad 0x0BF56C16BF55699D7 # -0.00138889 c2 + .quad 0x0BF56C16BF55699D7 + .quad 0x03EFA015C50A93B49 # 2.48016e-005 c3 + .quad 0x03EFA015C50A93B49 + .quad 0x0BE92524743CC46B8 # -2.75573e-007 c4 + .quad 0x0BE92524743CC46B8 + +.Lsinarray: + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x0BFC555555545E87D + .quad 0x03F811110DF01232D # 0.00833333 s2 + .quad 0x03F811110DF01232D + .quad 0x0BF2A013A88A37196 # -0.000198413 s3 + .quad 0x0BF2A013A88A37196 + .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4 + .quad 0x03EC6DBE4AD1572D5 + +.Lsincosarray: + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x03F811110DF01232D # 0.00833333 s2 + .quad 0x0BF56C16BF55699D7 + .quad 0x0BF2A013A88A37196 # -0.000198413 s3 + .quad 0x03EFA015C50A93B49 + .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4 + .quad 0x0BE92524743CC46B8 + +.Lcossinarray: + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x0BF56C16BF55699D7 # c2 + .quad 0x03F811110DF01232D + .quad 0x03EFA015C50A93B49 # c3 + .quad 0x0BF2A013A88A37196 + .quad 0x0BE92524743CC46B8 # c4 + .quad 0x03EC6DBE4AD1572D5 + + +.align 64 +.Levencos_oddsin_tbl: + + .quad .Lcoscos_coscos_piby4 # 0 * ; Done + .quad .Lcoscos_cossin_piby4 # 1 + ; Done + .quad .Lcoscos_sincos_piby4 # 2 ; Done + .quad .Lcoscos_sinsin_piby4 # 3 + ; Done + + .quad .Lcossin_coscos_piby4 # 4 ; Done + .quad .Lcossin_cossin_piby4 # 5 * ; Done + .quad .Lcossin_sincos_piby4 # 6 ; Done + .quad .Lcossin_sinsin_piby4 # 7 ; Done + + .quad .Lsincos_coscos_piby4 # 8 ; Done + .quad .Lsincos_cossin_piby4 # 9 ; TBD + .quad .Lsincos_sincos_piby4 # 10 * ; Done + .quad .Lsincos_sinsin_piby4 # 11 ; Done + + .quad .Lsinsin_coscos_piby4 # 12 ; Done + .quad .Lsinsin_cossin_piby4 # 13 + ; Done + .quad .Lsinsin_sincos_piby4 # 14 ; Done + .quad .Lsinsin_sinsin_piby4 # 15 * ; Done + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + .text + .align 16 + .p2align 4,,15 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# define local variable storage offsets +.equ p_temp,0 # temporary for get/put bits operation +.equ p_temp1,0x10 # temporary for get/put bits operation + +.equ save_xmm6,0x20 # temporary for get/put bits operation +.equ save_xmm7,0x30 # temporary for get/put bits operation +.equ save_xmm8,0x40 # temporary for get/put bits operation +.equ save_xmm9,0x50 # temporary for get/put bits operation +.equ save_xmm0,0x60 # temporary for get/put bits operation +.equ save_xmm11,0x70 # temporary for get/put bits operation +.equ save_xmm12,0x80 # temporary for get/put bits operation +.equ save_xmm13,0x90 # temporary for get/put bits operation +.equ save_xmm14,0x0A0 # temporary for get/put bits operation +.equ save_xmm15,0x0B0 # temporary for get/put bits operation + +.equ r,0x0C0 # pointer to r for remainder_piby2 +.equ rr,0x0D0 # pointer to r for remainder_piby2 +.equ region,0x0E0 # pointer to r for remainder_piby2 + +.equ r1,0x0F0 # pointer to r for remainder_piby2 +.equ rr1,0x0100 # pointer to r for remainder_piby2 +.equ region1,0x0110 # pointer to r for remainder_piby2 + +.equ p_temp2,0x0120 # temporary for get/put bits operation +.equ p_temp3,0x0130 # temporary for get/put bits operation + +.equ p_temp4,0x0140 # temporary for get/put bits operation +.equ p_temp5,0x0150 # temporary for get/put bits operation + +.equ p_original,0x0160 # original x +.equ p_mask,0x0170 # original x +.equ p_sign,0x0180 # original x + +.equ p_original1,0x0190 # original x +.equ p_mask1,0x01A0 # original x +.equ p_sign1,0x01B0 # original x + + +.equ save_r12,0x01C0 # temporary for get/put bits operation +.equ save_r13,0x01D0 # temporary for get/put bits operation + + +.globl __vrs4_cosf + .type __vrs4_cosf,@function +__vrs4_cosf: + sub $0x01E8,%rsp + +#DEBUG +# mov %r12,save_r12(%rsp) # save r12 +# mov %r13,save_r13(%rsp) # save r13 + +# mov save_r12(%rsp),%r12 # restore r12 +# mov save_r13(%rsp),%r13 # restore r13 + +# add $0x01E8,%rsp +# ret +#DEBUG + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + mov %r12,save_r12(%rsp) # save r12 + mov %r13,save_r13(%rsp) # save r13 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#STARTMAIN + + movhlps %xmm0,%xmm8 + cvtps2pd %xmm0,%xmm10 # convert input to double. + cvtps2pd %xmm8,%xmm1 # convert input to double. + +movdqa %xmm10,%xmm6 +movdqa %xmm1,%xmm7 +movapd .L__real_7fffffffffffffff(%rip),%xmm2 + +andpd %xmm2,%xmm10 #Unsign +andpd %xmm2,%xmm1 #Unsign + +movd %xmm10,%rax #rax is lower arg +movhpd %xmm10, p_temp+8(%rsp) # +mov p_temp+8(%rsp),%rcx #rcx = upper arg + +movd %xmm1,%r8 #r8 is lower arg +movhpd %xmm1, p_temp1+8(%rsp) # +mov p_temp1+8(%rsp),%r9 #r9 = upper arg + +movdqa %xmm10,%xmm12 +movdqa %xmm1,%xmm13 + +pcmpgtd %xmm6,%xmm12 +pcmpgtd %xmm7,%xmm13 +movdqa %xmm12,%xmm6 +movdqa %xmm13,%xmm7 +psrldq $4,%xmm12 +psrldq $4,%xmm13 +psrldq $8,%xmm6 +psrldq $8,%xmm7 + +mov $0x3FE921FB54442D18,%rdx #piby4 + +mov $0x411E848000000000,%r10 #5e5 + +movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + + +por %xmm6,%xmm12 +por %xmm7,%xmm13 + +movapd %xmm10,%xmm2 #x0 +movapd %xmm1,%xmm3 #x1 +movapd %xmm10,%xmm6 #x0 +movapd %xmm1,%xmm7 #x1 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm2 = x, xmm4 =0.5/t, xmm6 =x +# xmm3 = x, xmm5 =0.5/t, xmm7 =x +.align 16 +.Leither_or_both_arg_gt_than_piby4: + cmp %r10,%rax + jae .Lfirst_or_next3_arg_gt_5e5 + + cmp %r10,%rcx + jae .Lsecond_or_next2_arg_gt_5e5 + + cmp %r10,%r8 + jae .Lthird_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfourth_arg_gt_5e5 + + +# /* Find out what multiple of piby2 */ +# npi2 = (int)(x * twobypi + 0.5); + movapd .L__real_3fe45f306dc9c883(%rip),%xmm10 + mulpd %xmm10,%xmm2 # * twobypi + mulpd %xmm10,%xmm3 # * twobypi + + addpd %xmm4,%xmm2 # +0.5, npi2 + addpd %xmm4,%xmm3 # +0.5, npi2 + + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + + cvtdq2pd %xmm4,%xmm2 # and back to double. + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + + movd %xmm4,%r8 # Region + movd %xmm5,%r9 # Region + + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + mov %r8,%r10 + mov %r9,%r11 + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm1,%xmm7 # t-rhead + + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = npi2 (int), xmm10 =rhead, xmm8 =rtail, r8 = region, r10 = region, r12 = Sign +# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail, r9 = region, r11 = region, r13 = Sign + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + + mov %r10,%rax + mov %r11,%rcx + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + xor %rax,%r10 + xor %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r12,p_sign+8(%rsp) #write out upper sign bit + mov %r11,p_sign1(%rsp) #write out lower sign bit + mov %r13,p_sign1+8(%rsp) #write out upper sign bit + + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead +# xmm4 = Sign, xmm10 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail + movapd %xmm10,%xmm6 # rhead + movapd %xmm1,%xmm7 # rhead + + subpd %xmm8,%xmm10 # r = rhead - rtail + subpd %xmm9,%xmm1 # r = rhead - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm10,%xmm2 # move r for r2 + movapd %xmm1,%xmm3 # move r for r2 + + mulpd %xmm10,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + leaq .Levencos_oddsin_tbl(%rip),%rcx + jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfirst_or_next3_arg_gt_5e5: +# %rcx,,%rax r8, r9 + + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Be sure not to use %xmm3,%xmm1 and xmm7 +# Use %xmm8,,%xmm5 xmm0, xmm12 +# %xmm11,,%xmm9 xmm13 + + + movlpd %xmm10,r(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm10,%xmm10 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm6,%xmm6 + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1 + cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm6,%xmm10 + subsd %xmm0,%xmm10 # xmm10 = r=(rhead-rtail) + subsd %xmm10,%xmm6 # rr=rhead-r + subsd %xmm0,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm10,r+8(%rsp) # store upper r + movlpd %xmm6,rr+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_lower_naninf + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea r(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrs4_cosf_lower_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region(%rsp) # region =0 + +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + + movhlps %xmm10,%xmm6 #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_lower_naninf_of_both_gt_5e5 + + mov %rcx,p_temp(%rsp) #Save upper arg + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea r(%rsp),%rsi + +# added ins- changed input from xmm10 to xmm0 + movd %xmm10,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + mov p_temp(%rsp),%rcx #Restore upper arg + jmp 0f + +.L__vrs4_cosf_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_upper_naninf_of_both_gt_5e5 + + + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea r+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm6,%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrs4_cosf_upper_naninf_of_both_gt_5e5: + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region+4(%rsp) #region = 0 + +.align 16 +0: + jmp .Lcheck_next2_args + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsecond_or_next2_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Restore xmm4 and %xmm3,,%xmm1 xmm7 +# Can use %xmm0,,%xmm8 xmm12 +# %xmm9,,%xmm5 xmm11, xmm13 + + movhpd %xmm10,r+8(%rsp) #Save upper fp arg for remainder_piby2 call + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1 + cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm1 = piby2_2 + cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store upper region + + subsd %xmm0,%xmm6 # xmm10 = r=(rhead-rtail) + + movlpd %xmm6,r(%rsp) # store upper r + + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_upper_naninf + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea r+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrs4_cosf_upper_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region+4(%rsp) # region =0 + +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcheck_next2_args: + + mov $0x411E848000000000,%r10 #5e5 + + + cmp %r10,%r8 + jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfirst_second_done_fourth_arg_gt_5e5 + + + +# Work on next two args, both < 5e5 +# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5 + + movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi + addpd %xmm4,%xmm3 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + cvtdq2pd %xmm5,%xmm3 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm5,region1(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm1,%xmm7 # t-rhead + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + subpd %xmm9,%xmm1 # r = rhead - rtail + movapd %xmm1,r1(%rsp) + + jmp .L__vrs4_cosf_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lthird_or_fourth_arg_gt_5e5: +#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Can use %xmm11,,%xmm9 xmm13 +# %xmm8,,%xmm5 xmm0, xmm12 +# Restore xmm4 + +# Work on first two args, both < 5e5 + + + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm4,region(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + subpd %xmm8,%xmm10 # r = rhead - rtail + movapd %xmm10,r(%rsp) + +# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_third_or_fourth_arg_gt_5e5: +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + + + mov $0x411E848000000000,%r10 #5e5 + + cmp %r10,%r9 + jae .Lboth_arg_gt_5e5_higher + + +# Upper Arg is <5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg + movhlps %xmm3,%xmm3 + movhlps %xmm7,%xmm7 + + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r9d,region1+4(%rsp) # store upper region + + subsd %xmm10,%xmm7 # xmm1 = r=(rhead-rtail) + + movlpd %xmm7,r1+8(%rsp) # store upper r + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_lower_naninf_higher + + lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf + lea r1(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r1(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_cosf_lower_naninf_higher: + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region1(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrs4_cosf_reconstruct + + + + + + + +.align 16 +.Lboth_arg_gt_5e5_higher: +# Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + + movhlps %xmm1,%xmm7 #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_lower_naninf_of_both_gt_5e5_higher + + mov %r9,p_temp1(%rsp) #Save upper arg + lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf + lea r1(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm1,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp1(%rsp),%r9 #Restore upper arg + + jmp 0f + +.L__vrs4_cosf_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region1(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_upper_naninf_of_both_gt_5e5_higher + + lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea r1+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm7,%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_cosf_upper_naninf_of_both_gt_5e5_higher: + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region1+4(%rsp) #region = 0 + +.align 16 +0: + + jmp .L__vrs4_cosf_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfourth_arg_gt_5e5: +#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5 +#%rcx,,%rax r8, r9 +#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + +# Work on first two args, both < 5e5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm4,region(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + subpd %xmm8,%xmm10 # r = rhead - rtail + movapd %xmm10,r(%rsp) + +# Work on next two args, third arg < 5e5, fourth arg >= 5e5 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_fourth_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r8d,region1(%rsp) # store lower region + + subsd %xmm10,%xmm7 # xmm10 = r=(rhead-rtail) + + movlpd %xmm7,r1(%rsp) # store upper r + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_upper_naninf_higher + + lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea r1+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r1+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_cosf_upper_naninf_higher: + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region1+4(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrs4_cosf_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrs4_cosf_reconstruct: +#Results +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd r(%rsp),%xmm10 + movapd r1(%rsp),%xmm1 + + mov region(%rsp),%r8 + mov region1(%rsp),%r9 + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + + mov %r8,%r10 + mov %r9,%r11 + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + mov %r10,%rax + mov %r11,%rcx + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + xor %rax,%r10 + xor %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r12,p_sign+8(%rsp) #write out upper sign bit + mov %r11,p_sign1(%rsp) #write out lower sign bit + mov %r13,p_sign1+8(%rsp) #write out upper sign bit + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm10,%xmm2 + movapd %xmm1,%xmm3 + + mulpd %xmm10,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + leaq .Levencos_oddsin_tbl(%rip),%rcx + jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrs4_cosf_cleanup: + + movapd p_sign(%rsp),%xmm10 + movapd p_sign1(%rsp),%xmm1 + + xorpd %xmm4,%xmm10 # (+) Sign + xorpd %xmm5,%xmm1 # (+) Sign + + cvtpd2ps %xmm10,%xmm0 + cvtpd2ps %xmm1,%xmm11 + movlhps %xmm11,%xmm0 + + mov save_r12(%rsp),%r12 # restore r12 + mov save_r13(%rsp),%r13 # restore r13 + + add $0x01E8,%rsp + ret + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_coscos_piby4: + movapd %xmm2,%xmm0 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + + movapd .Lcosarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4 + + addpd .Lcosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcosarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm2,%xmm4 # x4(c3+x2c4) + mulpd %xmm3,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + + subpd %xmm0,%xmm4 # + t + subpd %xmm11,%xmm5 # + t + + jmp .L__vrs4_cosf_cleanup + +.align 16 +.Lcossin_cossin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movdqa .Lsincosarray+0x30(%rip),%xmm4 # s4 + movdqa .Lsincosarray+0x30(%rip),%xmm5 # s4 + movapd .Lsincosarray+0x10(%rip),%xmm8 # s2 + movapd .Lsincosarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s4 + mulpd %xmm3,%xmm5 # x2s4 + mulpd %xmm2,%xmm8 # x2s2 + mulpd %xmm3,%xmm9 # x2s2 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # s4+x2s3 + addpd .Lsincosarray+0x20(%rip),%xmm5 # s4+x2s3 + addpd .Lsincosarray(%rip),%xmm8 # s2+x2s1 + addpd .Lsincosarray(%rip),%xmm9 # s2+x2s1 + + mulpd %xmm0,%xmm4 # x4(s3+x2s4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + movhlps %xmm0,%xmm0 # move high x4 for cos term + movhlps %xmm11,%xmm11 # move high x4 for cos term + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term + movsd %xmm3,%xmm7 # move low x2 for x3 for sin term + mulsd %xmm10,%xmm6 # get low x3 for sin term + mulsd %xmm1,%xmm7 # get low x3 for sin term + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high r for cos + movhlps %xmm3,%xmm13 # move high r for cos + + movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos + movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm7,%xmm5 # sin *x3 + + mulsd %xmm0,%xmm8 # cos *x4 + mulsd %xmm11,%xmm9 # cos *x4 + + subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0 + + addsd %xmm10,%xmm4 # sin + x + addsd %xmm1,%xmm5 # sin + x + subsd %xmm12,%xmm8 # cos+t + subsd %xmm13,%xmm9 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + + jmp .L__vrs4_cosf_cleanup + +.align 16 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lsincos_cossin_piby4: + + movapd .Lsincosarray+0x30(%rip),%xmm4 # s4 + movapd .Lcossinarray+0x30(%rip),%xmm5 # s4 + movdqa .Lsincosarray+0x10(%rip),%xmm8 # s2 + movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + movapd %xmm3,%xmm7 # sincos term upper x2 for x3 + + mulpd %xmm2,%xmm4 # x2s4 + mulpd %xmm3,%xmm5 # x2s4 + mulpd %xmm2,%xmm8 # x2s2 + mulpd %xmm3,%xmm9 # x2s2 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # s3+x2s4 + addpd .Lcossinarray+0x20(%rip),%xmm5 # s3+x2s4 + addpd .Lsincosarray(%rip),%xmm8 # s1+x2s2 + addpd .Lcossinarray(%rip),%xmm9 # s1+x2s2 + + mulpd %xmm0,%xmm4 # x4(s3+x2s4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + movhlps %xmm0,%xmm0 # move high x4 for cos term + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin) + mulpd %xmm1,%xmm7 + + mulsd %xmm10,%xmm6 # get low x3 for sin term (cossin) + movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos) + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + + movhlps %xmm2,%xmm12 # move high r for cos (cossin) + + + movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin) + movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos) + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm11,%xmm5 # cos *x4 + mulsd %xmm0,%xmm8 # cos *x4 + mulsd %xmm7,%xmm9 # sin *x3 + + subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0 + + movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos) + + addsd %xmm10,%xmm4 # sin + x + + addsd %xmm11,%xmm9 # sin + x + + + subsd %xmm12,%xmm8 # cos+t + subsd %xmm3,%xmm5 # cos+t + + movlhps %xmm8,%xmm4 # cossin + movlhps %xmm9,%xmm5 # sincos + + jmp .L__vrs4_cosf_cleanup + +.align 16 +.Lsincos_sincos_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd .Lcossinarray+0x30(%rip),%xmm4 # s4 + movapd .Lcossinarray+0x30(%rip),%xmm5 # s4 + movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2 + movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + movapd %xmm2,%xmm6 # move x2 for x4 + movapd %xmm3,%xmm7 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # s4+x2s3 + addpd .Lcossinarray+0x20(%rip),%xmm5 # s4+x2s3 + addpd .Lcossinarray(%rip),%xmm8 # s2+x2s1 + addpd .Lcossinarray(%rip),%xmm9 # s2+x2s1 + + mulpd %xmm0,%xmm4 # x4(s4+x2s3) + mulpd %xmm11,%xmm5 # x4(s4+x2s3) + + mulpd %xmm10,%xmm6 # get low x3 for sin term + mulpd %xmm1,%xmm7 # get low x3 for sin term + movhlps %xmm6,%xmm6 # move low x2 for x3 for sin term + movhlps %xmm7,%xmm7 # move low x2 for x3 for sin term + + mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos terms + mulsd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm4,%xmm12 # xmm8 = sin , xmm4 = cos + movhlps %xmm5,%xmm13 # xmm9 = sin , xmm5 = cos + + mulsd %xmm6,%xmm12 # sin *x3 + mulsd %xmm7,%xmm13 # sin *x3 + mulsd %xmm0,%xmm4 # cos *x4 + mulsd %xmm11,%xmm5 # cos *x4 + + subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 #-t=r-1.0 + + movhlps %xmm10,%xmm0 # move high x for x for sin term + movhlps %xmm1,%xmm11 # move high x for x for sin term + # Reverse 10 and 0 + + addsd %xmm0,%xmm12 # sin + x + addsd %xmm11,%xmm13 # sin + x + + subsd %xmm2,%xmm4 # cos+t + subsd %xmm3,%xmm5 # cos+t + + movlhps %xmm12,%xmm4 + movlhps %xmm13,%xmm5 + jmp .L__vrs4_cosf_cleanup + +.align 16 +.Lcossin_sincos_piby4: + + movapd .Lcossinarray+0x30(%rip),%xmm4 # s4 + movapd .Lsincosarray+0x30(%rip),%xmm5 # s4 + movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2 + movdqa .Lsincosarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + movapd %xmm2,%xmm7 # upper x2 for x3 for sin term (sincos) + + mulpd %xmm2,%xmm4 # x2s4 + mulpd %xmm3,%xmm5 # x2s4 + mulpd %xmm2,%xmm8 # x2s2 + mulpd %xmm3,%xmm9 # x2s2 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # s3+x2s4 + addpd .Lsincosarray+0x20(%rip),%xmm5 # s3+x2s4 + addpd .Lcossinarray(%rip),%xmm8 # s1+x2s2 + addpd .Lsincosarray(%rip),%xmm9 # s1+x2s2 + + mulpd %xmm0,%xmm4 # x4(s3+x2s4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + movhlps %xmm11,%xmm11 # move high x4 for cos term + + movsd %xmm3,%xmm6 # move low x2 for x3 for sin term (cossin) + mulpd %xmm10,%xmm7 + + mulsd %xmm1,%xmm6 # get low x3 for sin term (cossin) + movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos) + + mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm3,%xmm12 # move high r for cos (cossin) + + movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos (sincos) + movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin (cossin) + + mulsd %xmm0,%xmm4 # cos *x4 + mulsd %xmm6,%xmm5 # sin *x3 + mulsd %xmm7,%xmm8 # sin *x3 + mulsd %xmm11,%xmm9 # cos *x4 + + subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0 + + movhlps %xmm10,%xmm11 # move high x for x for sin term (sincos) + + subsd %xmm2,%xmm4 # cos-(-t) + subsd %xmm12,%xmm9 # cos-(-t) + + addsd %xmm11,%xmm8 # sin + x + addsd %xmm1,%xmm5 # sin + x + + movlhps %xmm8,%xmm4 # cossin + movlhps %xmm9,%xmm5 # sincos + + jmp .L__vrs4_cosf_cleanup +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_sinsin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: SIN +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: COS +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm0 # x2 ; SIN + movapd %xmm3,%xmm11 # x2 ; COS + movapd %xmm3,%xmm1 # copy of x2 for x4 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm0 # x4 + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm3,%xmm1 # x4 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcosarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm10,%xmm2 # x3 + + mulpd %xmm0,%xmm4 # x4(c3+x2c4) + mulpd %xmm1,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm1,%xmm5 # x4 * zc + + addpd %xmm10,%xmm4 # +x + subpd %xmm11,%xmm5 # +t + + jmp .L__vrs4_cosf_cleanup + +.align 16 +.Lsinsin_coscos_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: COS +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: SIN +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm0 # x2 ; COS + movapd %xmm3,%xmm11 # x2 ; SIN + movapd %xmm2,%xmm10 # copy of x2 for x4 + + movdqa .Lcosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # s4 + movapd .Lcosarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # s2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # s4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # s2*x2 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # s3+x2c4 + addpd .Lcosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # s1+x2c2 + + mulpd %xmm1,%xmm3 # x3 + mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm10,%xmm4 # x4(c3+x2c4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zs + + mulpd %xmm10,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x3 * zc + + subpd %xmm0,%xmm4 # +t + addpd %xmm1,%xmm5 # +x + + jmp .L__vrs4_cosf_cleanup +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_cossin_piby4: #Derive from cossin_coscos + movhlps %xmm2,%xmm0 # x2 for 0.5x2 for upper cos + movsd %xmm2,%xmm6 # lower x2 for x3 for lower sin + movapd %xmm3,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + movapd .Lsincosarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lsincosarray(%rip),%xmm8 # c2+x2c1 + addpd .Lcosarray(%rip),%xmm9 # c2+x2c1 + + movapd %xmm12,%xmm2 # upper=x4 + movsd %xmm6,%xmm2 # lower=x2 + mulsd %xmm10,%xmm2 # lower=x2*x + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # upper= x4 * zc + # lower=x3 * zs + mulpd %xmm13,%xmm5 # x4 * zc + + movlhps %xmm7,%xmm10 # + addpd %xmm10,%xmm4 # +x for lower sin, +t for upper cos + subpd %xmm11,%xmm5 # -(-t) + + jmp .L__vrs4_cosf_cleanup + +.align 16 +.Lcoscos_sincos_piby4: #Derive from cossin_coscos + movsd %xmm2,%xmm0 # x2 for 0.5x2 for lower cos + movapd %xmm3,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lcossinarray+0x30(%rip),%xmm4 # cs4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + movapd .Lcossinarray+0x10(%rip),%xmm8 # cs2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lcossinarray(%rip),%xmm8 # c2+x2c1 + addpd .Lcosarray(%rip),%xmm9 # c2+x2c1 + + mulpd %xmm10,%xmm2 # upper=x3 for sin + mulsd %xmm10,%xmm2 # lower=x4 for cos + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # lower= x4 * zc + # upper= x3 * zs + mulpd %xmm13,%xmm5 # x4 * zc + + movsd %xmm7,%xmm10 + addpd %xmm10,%xmm4 # +x for upper sin, +t for lower cos + subpd %xmm11,%xmm5 # -(-t) + + jmp .L__vrs4_cosf_cleanup + +.align 16 +.Lcossin_coscos_piby4: + movhlps %xmm3,%xmm0 # x2 for 0.5x2 for upper cos + movapd %xmm2,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + movsd %xmm3,%xmm6 # lower x2 for x3 for sin + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4 + movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4 + movapd .Lcosarray+0x10(%rip),%xmm8 # cs2 + movapd .Lsincosarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lsincosarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lcosarray(%rip),%xmm8 # c2+x2c1 + addpd .Lsincosarray(%rip),%xmm9 # c2+x2c1 + + movapd %xmm13,%xmm3 # upper=x4 + movsd %xmm6,%xmm3 # lower x2 + mulsd %xmm1,%xmm3 # lower x2*x + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm12,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # upper= x4 * zc + # lower=x3 * zs + + movlhps %xmm7,%xmm1 + addpd %xmm1,%xmm5 # +x for lower sin, +t for upper cos + subpd %xmm11,%xmm4 # -(-t) + + jmp .L__vrs4_cosf_cleanup + +.align 16 +.Lcossin_sinsin_piby4: # Derived from sincos_coscos + + movhlps %xmm3,%xmm0 # x2 + movapd %xmm3,%xmm7 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4 + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lsincosarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsincosarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsincosarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm10,%xmm2 # x3 + movapd %xmm13,%xmm3 # upper x4 for cos + movsd %xmm7,%xmm3 # lower x2 for sin + mulsd %xmm1,%xmm3 # lower x3=x2*x for sin + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm11,%xmm1 # t for upper cos and x for lower sin + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zs + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # upper=x4 * zc + # lower=x3 * zs + + addpd %xmm10,%xmm4 # +x + addpd %xmm1,%xmm5 # +t upper, +x lower + + jmp .L__vrs4_cosf_cleanup + +.align 16 +.Lsincos_coscos_piby4: + movsd %xmm3,%xmm0 # x2 for 0.5x2 for lower cos + movapd %xmm2,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4 + movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4 + movapd .Lcosarray+0x10(%rip),%xmm8 # cs2 + movapd .Lcossinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lcossinarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lcosarray(%rip),%xmm8 # c2+x2c1 + addpd .Lcossinarray(%rip),%xmm9 # c2+x2c1 + + mulpd %xmm1,%xmm3 # upper=x3 for sin + mulsd %xmm1,%xmm3 # lower=x4 for cos + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm12,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # lower= x4 * zc + # upper= x3 * zs + + movsd %xmm7,%xmm1 + subpd %xmm11,%xmm4 # -(-t) + addpd %xmm1,%xmm5 # +x for upper sin, +t for lower cos + + + jmp .L__vrs4_cosf_cleanup + +.align 16 +.Lsincos_sinsin_piby4: # Derived from sincos_coscos + + movsd %xmm3,%xmm0 # x2 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4 + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lcossinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcossinarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcossinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm10,%xmm2 # x3 + mulpd %xmm1,%xmm3 # upper x3 for sin + mulsd %xmm1,%xmm3 # lower x4 for cos + + movhlps %xmm1,%xmm6 + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm6,%xmm11 # upper =t ; lower =x + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + addpd %xmm10,%xmm4 # +x + addpd %xmm11,%xmm5 # +t lower, +x upper + + jmp .L__vrs4_cosf_cleanup + +.align 16 +.Lsinsin_cossin_piby4: # Derived from sincos_coscos + + movhlps %xmm2,%xmm0 # x2 + movapd %xmm2,%xmm7 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # c4 + movapd .Lsincosarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsincosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm1,%xmm3 # x3 + movapd %xmm12,%xmm2 # upper x4 for cos + movsd %xmm7,%xmm2 # lower x2 for sin + mulsd %xmm10,%xmm2 # lower x3=x2*x for sin + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm11,%xmm10 # t for upper cos and x for lower sin + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # upper=x4 * zc + # lower=x3 * zs + + addpd %xmm1,%xmm5 # +x + addpd %xmm10,%xmm4 # +t upper, +x lower + + jmp .L__vrs4_cosf_cleanup + +.align 16 +.Lsinsin_sincos_piby4: # Derived from sincos_coscos + + movsd %xmm2,%xmm0 # x2 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lcossinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # c4 + movapd .Lcossinarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lcossinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm1,%xmm3 # x3 + mulpd %xmm10,%xmm2 # upper x3 for sin + mulsd %xmm10,%xmm2 # lower x4 for cos + + movhlps %xmm10,%xmm6 + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm6,%xmm11 + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + + addpd %xmm1,%xmm5 # +x + addpd %xmm11,%xmm4 # +t lower, +x upper + + jmp .L__vrs4_cosf_cleanup + +.align 16 +.Lsinsin_sinsin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + #x2 = x * x; + #(x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4)))); + + #x + x3 * ((c1 + x2 *c2) + x4 * (c3 + x2 * c4)); + + + movapd %xmm2,%xmm0 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # c4 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # c2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4 + + mulpd %xmm10,%xmm2 # x3 + mulpd %xmm1,%xmm3 # x3 + + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm0,%xmm4 # x4(c3+x2c4) + mulpd %xmm11,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zs + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # x3 * zs + + addpd %xmm10,%xmm4 # +x + addpd %xmm1,%xmm5 # +x + + jmp .L__vrs4_cosf_cleanup
diff --git a/src/gas/vrs4expf.S b/src/gas/vrs4expf.S new file mode 100644 index 0000000..b0e23aa --- /dev/null +++ b/src/gas/vrs4expf.S
@@ -0,0 +1,410 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# __vrs4_expf.s +# +# A vector implementation of the expf libm function. +# This routine implemented in single precision. It is slightly +# less accurate than the double precision version, but it will +# be better for vectorizing. +# +# Prototype: +# +# __m128 __vrs4_expf(__m128 x); +# +# Computes e raised to the x power for 4 floats at a time. +# Does not perform error handling, but does return C99 values for error +# inputs. Denormal results are truncated to 0. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + .text + .align 16 + .p2align 4,,15 + +# define local variable storage offsets +.equ p_temp,0 # temporary for get/put bits operation +.equ p_ux,0x10 # local storage for ux array +.equ p_m,0x20 # local storage for m array +.equ p_j,0x30 # local storage for m array +.equ save_rbx,0x040 #qword +.equ stack_size,0x48 + + + +.globl __vrs4_expf + .type __vrs4_expf,@function +__vrs4_expf: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) + + + movaps %xmm0,p_ux(%rsp) + maxps .L__real_m8192(%rip),%xmm0 # protect against small input values + + +# /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */ +# Step 1. Reduce the argument. + # r = x * thirtytwo_by_logbaseof2; + movaps .L__real_thirtytwo_by_log2(%rip),%xmm2 # + mulps %xmm0,%xmm2 + xor %rax,%rax + minps .L__real_8192(%rip),%xmm2 # protect against large input values + +# /* Set n = nearest integer to r */ + cvtps2dq %xmm2,%xmm3 + lea .L__two_to_jby32_table(%rip),%rdi + cvtdq2ps %xmm3,%xmm1 + + +# r1 = x - n * logbaseof2_by_32_lead; + movaps .L__real_log2_by_32_head(%rip),%xmm2 + mulps %xmm1,%xmm2 + subps %xmm2,%xmm0 # r1 in xmm0, + +# r2 = - n * logbaseof2_by_32_lead; + mulps .L__real_log2_by_32_tail(%rip),%xmm1 + +# j = n & 0x0000001f; + movdqa %xmm3,%xmm4 + movdqa .L__int_mask_1f(%rip),%xmm2 + pand %xmm4,%xmm2 + movdqa %xmm2,p_j(%rsp) +# f1 = two_to_jby32_lead_table[j]; + +# *m = (n - j) / 32; + psubd %xmm2,%xmm4 + psrad $5,%xmm4 + movdqa %xmm4,p_m(%rsp) + + movaps %xmm0,%xmm3 + addps %xmm1,%xmm3 + + mov p_j(%rsp),%eax # get an individual index + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j(%rsp) # save the f1 value + +# Step 2. Compute the polynomial. +# q = r1 + +# r*r*( 5.00000000000000008883e-01 + +# r*( 1.66666666665260878863e-01 + +# r*( 4.16666666662260795726e-02 + +# r*( 8.33336798434219616221e-03 + +# r*( 1.38889490863777199667e-03 ))))); +# q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720 +# q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision + movaps %xmm3,%xmm4 + movaps %xmm3,%xmm2 # x*x + mulps %xmm2,%xmm2 + mulps .L__real_1_24(%rip),%xmm4 # /24 + + mov p_j+4(%rsp),%eax # get an individual index + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j+4(%rsp) # save the f1 value + + addps .L__real_1_6(%rip),%xmm4 # +1/6 + + mulps %xmm2,%xmm3 # x^3 + mov p_j+8(%rsp),%eax # get an individual index + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j+8(%rsp) # save the f1 value + mulps .L__real_half(%rip),%xmm2 # x^2/2 + mulps %xmm3,%xmm4 # *x^3 + + mov p_j+12(%rsp),%eax # get an individual index + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j+12(%rsp) # save the f1 value + addps %xmm4,%xmm1 # +r2 + + addps %xmm2,%xmm1 # + x^2/2 + addps %xmm1,%xmm0 # +r1 + +# deal with infinite or denormal results + movdqa p_m(%rsp),%xmm1 + movdqa p_m(%rsp),%xmm2 + pcmpgtd .L__int_127(%rip),%xmm2 + pminsw .L__int_128(%rip),%xmm1 # ceil at 128 + movmskps %xmm2,%eax + test $0x0f,%eax + + paddd .L__int_127(%rip),%xmm1 # add bias + +# *z2 = f2 + ((f1 + f2) * q); + mulps p_j(%rsp),%xmm0 # * f1 + addps p_j(%rsp),%xmm0 # + f1 + jnz .L__exp_largef +.L__check1: + + pxor %xmm2,%xmm2 # floor at 0 + pmaxsw %xmm2,%xmm1 + + pslld $23,%xmm1 # build 2^n + + movaps %xmm1,%xmm2 + + +# check for infinity or nan + movaps p_ux(%rsp),%xmm1 + andps .L__real_infinity(%rip),%xmm1 + cmpps $0,.L__real_infinity(%rip),%xmm1 + movmskps %xmm1,%eax + test $0xf,%eax + + +# end of splitexp +# /* Scale (z1 + z2) by 2.0**m */ +# Step 3. Reconstitute. + + mulps %xmm2,%xmm0 # result *= 2^n + +# we'd like to avoid a branch, and can use cmp's and and's to +# eliminate them. But it adds cycles for normal cases +# to handle events that are supposed to be exceptions. +# Using this branch with the +# check above results in faster code for the normal cases. +# And branch mispredict penalties should only come into +# play for nans and infinities. + jnz .L__exp_naninf + +# +.L__final_check: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + +# deal with nans and infinities + +.L__exp_naninf: + movaps %xmm0,p_temp(%rsp) # save the computed values + mov %eax,%ecx # save the error mask + test $1,%ecx # first value? + jz .__Lni2 + mov p_ux(%rsp),%edx # get the input + call .L__naninf + mov %edx,p_temp(%rsp) # save the new result +.__Lni2: + test $2,%ecx # first value? + jz .__Lni3 + mov p_ux+4(%rsp),%edx # get the input + call .L__naninf + mov %edx,p_temp+4(%rsp) # save the new result +.__Lni3: + test $4,%ecx # first value? + jz .__Lni4 + mov p_ux+8(%rsp),%edx # get the input + call .L__naninf + mov %edx,p_temp+8(%rsp) # save the new result +.__Lni4: + test $8,%ecx # first value? + jz .__Lnie + mov p_ux+12(%rsp),%edx # get the input + call .L__naninf + mov %edx,p_temp+12(%rsp) # save the new result +.__Lnie: + movaps p_temp(%rsp),%xmm0 # get the answers + jmp .L__final_check + + +# a simple subroutine to check a scalar input value for infinity +# or NaN and return the correct result +# expects input in edx, and returns value in edx. Destroys eax. +.L__naninf: + mov $0x0007FFFFF,%eax + test %eax,%edx + jnz .L__enan # jump if mantissa not zero, so it's a NaN +# inf + mov %edx,%eax + rcl $1,%eax + jnc .L__r # exp(+inf) = inf + xor %edx,%edx # exp(-inf) = 0 + jmp .L__r + +#NaN +.L__enan: + mov $0x000400000,%eax # convert to quiet + or %eax,%edx +.L__r: + ret + + .align 16 +# deal with m > 127. In some instances, rounding during calculations +# can result in infinity when it shouldn't. For these cases, we scale +# m down, and scale the mantissa up. + +.L__exp_largef: + movdqa %xmm0,p_temp(%rsp) # save the mantissa portion + movdqa %xmm1,p_m(%rsp) # save the exponent portion + mov %eax,%ecx # save the error mask + test $1,%ecx # first value? + jz .L__Lf2 + mov p_m(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m(%rsp) # save the exponent + movss p_temp(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_temp(%rsp) # save the mantissa +.L__Lf2: + test $2,%ecx # second value? + jz .L__Lf3 + mov p_m+4(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m+4(%rsp) # save the exponent + movss p_temp+4(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_temp+4(%rsp) # save the mantissa +.L__Lf3: + test $4,%ecx # third value? + jz .L__Lf4 + mov p_m+8(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m+8(%rsp) # save the exponent + movss p_temp+8(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_temp+8(%rsp) # save the mantissa +.L__Lf4: + test $8,%ecx # fourth value? + jz .L__Lfe + mov p_m+12(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m+12(%rsp) # save the exponent + movss p_temp+12(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_temp+12(%rsp) # save the mantissa +.L__Lfe: + movaps p_temp(%rsp),%xmm0 # restore the mantissa portion back + movdqa p_m(%rsp),%xmm1 # restore the exponent portion + jmp .L__check1 + + .data + .align 64 + +.L__real_half: .long 0x3f000000 # 1/2 + .long 0x3f000000 + .long 0x3f000000 + .long 0x3f000000 + +.L__real_two: .long 0x40000000 # 2 + .long 0x40000000 + .long 0x40000000 + .long 0x40000000 + +.L__real_8192: .long 0x46000000 # 8192, to protect against really large numbers + .long 0x46000000 + .long 0x46000000 + .long 0x46000000 +.L__real_m8192: .long 0xC6000000 # -8192, to protect against really small numbers + .long 0xC6000000 + .long 0xC6000000 + .long 0xC6000000 + +.L__real_thirtytwo_by_log2: .long 0x4238AA3B # thirtytwo_by_log2 + .long 0x4238AA3B + .long 0x4238AA3B + .long 0x4238AA3B + +.L__real_log2_by_32: .long 0x3CB17218 # log2_by_32 + .long 0x3CB17218 + .long 0x3CB17218 + .long 0x3CB17218 + +.L__real_log2_by_32_head: .long 0x3CB17000 # log2_by_32 + .long 0x3CB17000 + .long 0x3CB17000 + .long 0x3CB17000 + +.L__real_log2_by_32_tail: .long 0xB585FDF4 # log2_by_32 + .long 0xB585FDF4 + .long 0xB585FDF4 + .long 0xB585FDF4 + +.L__real_1_6: .long 0x3E2AAAAB # 0.16666666666 used in polynomial + .long 0x3E2AAAAB + .long 0x3E2AAAAB + .long 0x3E2AAAAB + +.L__real_1_24: .long 0x3D2AAAAB # 0.041666668 used in polynomial + .long 0x3D2AAAAB + .long 0x3D2AAAAB + .long 0x3D2AAAAB + +.L__real_infinity: .long 0x7f800000 # infinity + .long 0x7f800000 + .long 0x7f800000 + .long 0x7f800000 +.L__int_mask_1f: .long 0x00000001f + .long 0x00000001f + .long 0x00000001f + .long 0x00000001f +.L__int_128: .long 0x000000080 + .long 0x000000080 + .long 0x000000080 + .long 0x000000080 +.L__int_127: .long 0x00000007f + .long 0x00000007f + .long 0x00000007f + .long 0x00000007f + + +.L__two_to_jby32_table: + .long 0x3F800000 # 1 + .long 0x3F82CD87 # 1.0218972 + .long 0x3F85AAC3 # 1.0442737 + .long 0x3F88980F # 1.0671405 + .long 0x3F8B95C2 # 1.0905077 + .long 0x3F8EA43A # 1.1143868 + .long 0x3F91C3D3 # 1.1387886 + .long 0x3F94F4F0 # 1.1637249 + .long 0x3F9837F0 # 1.1892071 + .long 0x3F9B8D3A # 1.2152474 + .long 0x3F9EF532 # 1.2418578 + .long 0x3FA27043 # 1.269051 + .long 0x3FA5FED7 # 1.2968396 + .long 0x3FA9A15B # 1.3252367 + .long 0x3FAD583F # 1.3542556 + .long 0x3FB123F6 # 1.3839099 + .long 0x3FB504F3 # 1.4142135 + .long 0x3FB8FBAF # 1.4451808 + .long 0x3FBD08A4 # 1.4768262 + .long 0x3FC12C4D # 1.5091645 + .long 0x3FC5672A # 1.5422108 + .long 0x3FC9B9BE # 1.5759809 + .long 0x3FCE248C # 1.6104903 + .long 0x3FD2A81E # 1.6457555 + .long 0x3FD744FD # 1.6817929 + .long 0x3FDBFBB8 # 1.7186193 + .long 0x3FE0CCDF # 1.7562522 + .long 0x3FE5B907 # 1.7947091 + .long 0x3FEAC0C7 # 1.8340081 + .long 0x3FEFE4BA # 1.8741677 + .long 0x3FF5257D # 1.9152066 + .long 0x3FFA83B3 # 1.9571441 + .long 0 # for alignment +
diff --git a/src/gas/vrs4log10f.S b/src/gas/vrs4log10f.S new file mode 100644 index 0000000..d6d9ac8 --- /dev/null +++ b/src/gas/vrs4log10f.S
@@ -0,0 +1,646 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrs4logf.s +# +# A vector implementation of the logf libm function. +# This routine implemented in single precision. It is slightly +# less accurate than the double precision version, but it will +# be better for vectorizing. +# +# Prototype: +# +# __m128 __vrs4_log10f(__m128 x); +# +# Computes the natural log of x. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. +# +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + .text + .align 16 + .p2align 4,,15 + +# define local variable storage offsets +.equ p_x,0 # save x +.equ p_idx,0x010 # xmmword index +.equ p_z1,0x020 # xmmword index +.equ p_q,0x030 # xmmword index +.equ p_corr,0x040 # xmmword index +.equ p_omask,0x050 # xmmword index +.equ save_xmm6,0x060 # +.equ save_rbx,0x070 # + +.equ stack_size,0x088 + + + +.globl __vrs4_log10f + .type __vrs4_log10f,@function +__vrs4_log10f: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx + +# check e as a special case + movdqa %xmm0,p_x(%rsp) # save x +# movdqa %xmm0,%xmm2 +# cmpps $0,.L__real_ef(%rip),%xmm2 +# movmskps %xmm2,%r9d + +# +# compute the index into the log tables +# + movdqa %xmm0,%xmm3 + movaps %xmm0,%xmm1 + psrld $23,%xmm3 + subps .L__real_one(%rip),%xmm1 + psubd .L__mask_127(%rip),%xmm3 + cvtdq2ps %xmm3,%xmm6 # xexp + + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + xor %r8,%r8 + movdqa %xmm3,%xmm2 + movaps .L__real_half(%rip),%xmm5 # .5 + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrld $16,%xmm3 + lea .L__np_ln_lead_table(%rip),%rdx + movdqa %xmm3,%xmm4 + psrld $1,%xmm3 + paddd .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm4 + paddd %xmm4,%xmm3 + cvtdq2ps %xmm3,%xmm1 + packssdw %xmm3,%xmm3 + movq %xmm3,p_idx(%rsp) + + +# reduce and get u + movdqa %xmm0,%xmm3 + orps .L__real_half(%rip),%xmm2 + + + mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128 + + + subps %xmm1,%xmm2 # f2 = f - f1 + mulps %xmm2,%xmm5 + addps %xmm5,%xmm1 + + divps %xmm1,%xmm2 # u + + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1(%rsp) # save the f1 values + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1+8(%rsp) # save the f1 value + +# solve for ln(1+u) + movaps %xmm2,%xmm1 # u + mulps %xmm2,%xmm2 # u^2 + movaps %xmm2,%xmm5 + lea .L__np_ln_tail_table(%rip),%rdx + movaps .L__real_cb3(%rip),%xmm3 + mulps %xmm2,%xmm3 #Cu2 + mulps %xmm1,%xmm5 # u^3 + addps .L__real_cb2(%rip),%xmm3 #B+Cu2 + movaps %xmm2,%xmm4 + mulps %xmm5,%xmm4 # u^5 + movaps .L__real_log2_lead(%rip),%xmm2 + + mulps .L__real_cb1(%rip),%xmm5 #Au3 + addps %xmm5,%xmm1 # u+Au3 + mulps %xmm3,%xmm4 # u5(B+Cu2) + + addps %xmm4,%xmm1 # poly + +# recombine + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q(%rsp) # save the f1 value + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q+8(%rsp) # save the f1 value + + addps p_q(%rsp),%xmm1 #z2 +=q + + movaps p_z1(%rsp),%xmm0 # z1 values + + mulps %xmm6,%xmm2 + addps %xmm2,%xmm0 #r1 + movaps %xmm0,%xmm2 + mulps .L__real_log2_tail(%rip),%xmm6 + addps %xmm6,%xmm1 #r2 + movaps %xmm1,%xmm3 + +# logef to log10f + mulps .L__real_log10e_tail(%rip),%xmm1 + mulps .L__real_log10e_tail(%rip),%xmm0 + mulps .L__real_log10e_lead(%rip),%xmm3 + mulps .L__real_log10e_lead(%rip),%xmm2 + addps %xmm1,%xmm0 + addps %xmm3,%xmm0 + addps %xmm2,%xmm0 +# addps %xmm1,%xmm0 + +# check for e +# test $0x0f,%r9d +# jnz .L__vlogf_e +.L__f1: + +# check for negative numbers or zero + xorps %xmm1,%xmm1 + cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also. + movmskps %xmm1,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg + +.L__f2: +## if +inf + movaps p_x(%rsp),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__log_inf +.L__f3: + + movaps p_x(%rsp),%xmm3 + subps .L__real_one(%rip),%xmm3 + andps .L__real_notsign(%rip),%xmm3 + cmpps $2,.L__real_threshold(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__near_one + + +.L__finish: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + +.L__vlogf_e: + movdqa p_x(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm0,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + jmp .L__f1 + + .align 16 +.L__near_one: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm3,p_omask(%rsp) # save ones mask + movaps p_x(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r +# u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm1 + divps %xmm2,%xmm1 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm1,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction +# u = u + u; + addps %xmm1,%xmm1 #u + movaps %xmm1,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm1,%xmm5 # Cu + movaps %xmm1,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm1 + mulps %xmm1,%xmm1 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm1 #u6(Cu+Du3) + addps %xmm1,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + + movaps %xmm3,%xmm5 #r1=r + pand .L__mask_lower(%rip),%xmm5 + subps %xmm5,%xmm3 + addps %xmm3,%xmm2 #r2 = r2 + (r-r1) + + movaps %xmm5,%xmm3 + movaps %xmm2,%xmm1 + + mulps .L__real_log10e_tail(%rip),%xmm2 + mulps .L__real_log10e_tail(%rip),%xmm3 + mulps .L__real_log10e_lead(%rip),%xmm1 + mulps .L__real_log10e_lead(%rip),%xmm5 + addps %xmm2,%xmm3 + addps %xmm1,%xmm3 + addps %xmm5,%xmm3 +# return r + r2; +# addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm0,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + + jmp .L__finish + +# we have a zero, a negative number, or both. +# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf. +.L__z_or_neg: +# deal with negatives first + movdqa %xmm1,%xmm3 + andps %xmm0,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm1 # setup the nan values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace +# check for +/- 0 + xorps %xmm1,%xmm1 + cmpps $0,p_x(%rsp),%xmm1 # 0 ?. + movmskps %xmm1,%r9d + test $0x0f,%r9d + jz .L__zn2 + + movdqa %xmm1,%xmm3 + andnps %xmm0,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + +.L__zn2: +# check for NaNs + movaps p_x(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x(%rsp),%xmm1 # isolate the NaNs + pand %xmm4,%xmm1 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm1,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm1 + andnps %xmm0,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f2 + +# handle only +inf log(+inf) = inf +.L__log_inf: + movdqa %xmm3,%xmm1 + andnps %xmm0,%xmm3 # keep the non-error values + andps p_x(%rsp),%xmm1 # setup the +inf values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + jmp .L__f3 + + + .data + .align 64 + +.L__real_zero: .quad 0x00000000000000000 # 1.0 + .quad 0x00000000000000000 +.L__real_one: .quad 0x03f8000003f800000 # 1.0 + .quad 0x03f8000003f800000 +.L__real_two: .quad 0x04000000040000000 # 1.0 + .quad 0x04000000040000000 +.L__real_ninf: .quad 0x0ff800000ff800000 # -inf + .quad 0x0ff800000ff800000 +.L__real_inf: .quad 0x07f8000007f800000 # +inf + .quad 0x07f8000007f800000 +.L__real_nan: .quad 0x07fc000007fc00000 # NaN + .quad 0x07fc000007fc00000 +.L__real_ef: .quad 0x0402DF854402DF854 # float e + .quad 0x0402DF854402DF854 + +.L__real_sign: .quad 0x08000000080000000 # sign bit + .quad 0x08000000080000000 +.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit + .quad 0x07ffFFFFF7ffFFFFF +.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit + .quad 0x00040000000400000 +.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits + .quad 0x0007FFFFF007FFFFF +.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */ + .quad 0x03c0000003c000000 +.L__mask_127: .quad 0x00000007f0000007f # + .quad 0x00000007f0000007f +.L__mask_040: .quad 0x00000004000000040 # + .quad 0x00000004000000040 +.L__mask_001: .quad 0x00000000100000001 # + .quad 0x00000000100000001 + + +.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03 + .quad 0x03CF5C28F3CF5C28F + +.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03 + .quad 0x03B1249183B124918 +.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04 + .quad 0x039E401A639E401A6 +.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03 + .quad 0x03B124A123B124A12 +.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375 + .quad 0x03F3170003F317000 +.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183 + .quad 0x03805FDF43805FDF4 +.L__real_half: .quad 0x03f0000003f000000 # 1/2 + .quad 0x03f0000003f000000 +.L__real_log10e_lead: .quad 0x03EDE00003EDE0000 # log10e_lead 0.4335937500 + .quad 0x03EDE00003EDE0000 +.L__real_log10e_tail: .quad 0x03A37B1523A37B152 # log10e_tail 0.0007007319 + .quad 0x03A37B1523A37B152 + + +.L__mask_lower: .quad 0x0ffff0000ffff0000 # + .quad 0x0ffff0000ffff0000 + +.L__np_ln__table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02 + .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02 + .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02 + .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02 + .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02 + .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02 + .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01 + .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01 + .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01 + .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01 + .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01 + .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01 + .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01 + .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01 + .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01 + .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01 + .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01 + .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01 + .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01 + .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01 + .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01 + .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01 + .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01 + .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01 + .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01 + .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01 + .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01 + .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01 + .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01 + .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01 + .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01 + .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01 + .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01 + .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01 + .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01 + .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01 + .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01 + .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01 + .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01 + .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01 + .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01 + .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01 + .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01 + .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01 + .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01 + .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01 + .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01 + .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01 + .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01 + .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01 + .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01 + .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01 + .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01 + .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01 + .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01 + .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01 + .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01 + .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01 + .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01 + .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01 + .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01 + .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01 + .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01 + .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_lead_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x3C7E0000 # 0.015502929688 1 + .long 0x3CFC1000 # 0.030769348145 2 + .long 0x3D3BA000 # 0.045806884766 3 + .long 0x3D785000 # 0.060623168945 4 + .long 0x3D9A0000 # 0.075195312500 5 + .long 0x3DB78000 # 0.089599609375 6 + .long 0x3DD49000 # 0.103790283203 7 + .long 0x3DF13000 # 0.117767333984 8 + .long 0x3E06B000 # 0.131530761719 9 + .long 0x3E14A000 # 0.145141601563 10 + .long 0x3E226000 # 0.158569335938 11 + .long 0x3E2FF000 # 0.171813964844 12 + .long 0x3E3D5000 # 0.184875488281 13 + .long 0x3E4A9000 # 0.197814941406 14 + .long 0x3E579000 # 0.210510253906 15 + .long 0x3E647000 # 0.223083496094 16 + .long 0x3E713000 # 0.235534667969 17 + .long 0x3E7DC000 # 0.247802734375 18 + .long 0x3E851000 # 0.259887695313 19 + .long 0x3E8B3000 # 0.271850585938 20 + .long 0x3E914000 # 0.283691406250 21 + .long 0x3E974000 # 0.295410156250 22 + .long 0x3E9D3000 # 0.307006835938 23 + .long 0x3EA30000 # 0.318359375000 24 + .long 0x3EA8D000 # 0.329711914063 25 + .long 0x3EAE8000 # 0.340820312500 26 + .long 0x3EB43000 # 0.351928710938 27 + .long 0x3EB9C000 # 0.362792968750 28 + .long 0x3EBF5000 # 0.373657226563 29 + .long 0x3EC4D000 # 0.384399414063 30 + .long 0x3ECA3000 # 0.394897460938 31 + .long 0x3ECF9000 # 0.405395507813 32 + .long 0x3ED4E000 # 0.415771484375 33 + .long 0x3EDA2000 # 0.426025390625 34 + .long 0x3EDF5000 # 0.436157226563 35 + .long 0x3EE47000 # 0.446166992188 36 + .long 0x3EE99000 # 0.456176757813 37 + .long 0x3EEEA000 # 0.466064453125 38 + .long 0x3EF3A000 # 0.475830078125 39 + .long 0x3EF89000 # 0.485473632813 40 + .long 0x3EFD7000 # 0.494995117188 41 + .long 0x3F012000 # 0.504394531250 42 + .long 0x3F039000 # 0.513916015625 43 + .long 0x3F05F000 # 0.523193359375 44 + .long 0x3F084000 # 0.532226562500 45 + .long 0x3F0AA000 # 0.541503906250 46 + .long 0x3F0CF000 # 0.550537109375 47 + .long 0x3F0F4000 # 0.559570312500 48 + .long 0x3F118000 # 0.568359375000 49 + .long 0x3F13C000 # 0.577148437500 50 + .long 0x3F160000 # 0.585937500000 51 + .long 0x3F183000 # 0.594482421875 52 + .long 0x3F1A7000 # 0.603271484375 53 + .long 0x3F1C9000 # 0.611572265625 54 + .long 0x3F1EC000 # 0.620117187500 55 + .long 0x3F20E000 # 0.628417968750 56 + .long 0x3F230000 # 0.636718750000 57 + .long 0x3F252000 # 0.645019531250 58 + .long 0x3F273000 # 0.653076171875 59 + .long 0x3F295000 # 0.661376953125 60 + .long 0x3F2B5000 # 0.669189453125 61 + .long 0x3F2D6000 # 0.677246093750 62 + .long 0x3F2F7000 # 0.685302734375 63 + .long 0x3F317000 # 0.693115234375 64 + .long 0 # for alignment + +.L__np_ln_tail_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x35A8B0FC # 0.000001256848 1 + .long 0x361B0E78 # 0.000002310522 2 + .long 0x3631EC66 # 0.000002651266 3 + .long 0x35C30046 # 0.000001452871 4 + .long 0x37EBCB0E # 0.000028108738 5 + .long 0x37528AE5 # 0.000012549314 6 + .long 0x36DA7496 # 0.000006510479 7 + .long 0x3783B715 # 0.000015701671 8 + .long 0x383F3E68 # 0.000045596069 9 + .long 0x38297C10 # 0.000040408282 10 + .long 0x3815B666 # 0.000035694240 11 + .long 0x38183854 # 0.000036292084 12 + .long 0x38448108 # 0.000046850211 13 + .long 0x373539E9 # 0.000010801924 14 + .long 0x3864A740 # 0.000054515200 15 + .long 0x387BE3CD # 0.000060055219 16 + .long 0x3803B715 # 0.000031403342 17 + .long 0x380C36AF # 0.000033429529 18 + .long 0x3892713A # 0.000069829126 19 + .long 0x38AE55D6 # 0.000083129547 20 + .long 0x38A0FDE8 # 0.000076766883 21 + .long 0x3862BAE1 # 0.000054056643 22 + .long 0x3798AAD3 # 0.000018199358 23 + .long 0x38C5E10E # 0.000094356117 24 + .long 0x382D872E # 0.000041372310 25 + .long 0x38DEDFAC # 0.000106274470 26 + .long 0x38481E9B # 0.000047712219 27 + .long 0x38EBFB5E # 0.000112524940 28 + .long 0x38783B83 # 0.000059183232 29 + .long 0x374E1B05 # 0.000012284848 30 + .long 0x38CA0E11 # 0.000096347307 31 + .long 0x3891F660 # 0.000069600297 32 + .long 0x386C9A9A # 0.000056410769 33 + .long 0x38777BCD # 0.000059004688 34 + .long 0x38A6CED4 # 0.000079540216 35 + .long 0x38FBE3CD # 0.000120110439 36 + .long 0x387E7E01 # 0.000060675669 37 + .long 0x37D40984 # 0.000025276800 38 + .long 0x3784C3AD # 0.000015826745 39 + .long 0x380F5FAF # 0.000034182969 40 + .long 0x38AC47BC # 0.000082149607 41 + .long 0x392952D3 # 0.000161479504 42 + .long 0x37F97073 # 0.000029735476 43 + .long 0x3865C84A # 0.000054784388 44 + .long 0x3979CF17 # 0.000238236375 45 + .long 0x38C3D2F5 # 0.000093376184 46 + .long 0x38E6B468 # 0.000110008579 47 + .long 0x383EBCE1 # 0.000045475437 48 + .long 0x39186BDF # 0.000145360347 49 + .long 0x392F0945 # 0.000166927537 50 + .long 0x38E9ED45 # 0.000111545007 51 + .long 0x396B99A8 # 0.000224685878 52 + .long 0x37A27674 # 0.000019367064 53 + .long 0x397069AB # 0.000229275480 54 + .long 0x39013539 # 0.000123222257 55 + .long 0x3947F423 # 0.000190690669 56 + .long 0x3945E10E # 0.000188712234 57 + .long 0x38F85DB0 # 0.000118430122 58 + .long 0x396C08DC # 0.000225100142 59 + .long 0x37B4996F # 0.000021529120 60 + .long 0x397CEADA # 0.000241200818 61 + .long 0x3920261B # 0.000152729845 62 + .long 0x35AA4906 # 0.000001268724 63 + .long 0x3805FDF4 # 0.000031946183 64 + .long 0 # for alignment +
diff --git a/src/gas/vrs4log2f.S b/src/gas/vrs4log2f.S new file mode 100644 index 0000000..05185b2 --- /dev/null +++ b/src/gas/vrs4log2f.S
@@ -0,0 +1,639 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrs4log2f.s +# +# A vector implementation of the logf libm function. +# This routine implemented in single precision. It is slightly +# less accurate than the double precision version, but it will +# be better for vectorizing. +# +# Prototype: +# +# __m128 __vrs4_log2f(__m128 x); +# +# Computes the natural log of x. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + .text + .align 16 + .p2align 4,,15 + +# define local variable storage offsets +.equ p_x,0 # save x +.equ p_idx,0x010 # xmmword index +.equ p_z1,0x020 # xmmword index +.equ p_q,0x030 # xmmword index +.equ p_corr,0x040 # xmmword index +.equ p_omask,0x050 # xmmword index +.equ save_xmm6,0x060 # +.equ save_rbx,0x070 # + +.equ stack_size,0x088 + + + +.globl __vrs4_log2f + .type __vrs4_log2f,@function +__vrs4_log2f: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx + +# check 2 as a special case + movdqa %xmm0,p_x(%rsp) # save x +# movdqa %xmm0,%xmm2 +# cmpps $0,.L__real_ef(%rip),%xmm2 +# movmskps %xmm2,%r9d + +# +# compute the index into the log tables +# + movdqa %xmm0,%xmm3 + movaps %xmm0,%xmm1 + psrld $23,%xmm3 + subps .L__real_one(%rip),%xmm1 + psubd .L__mask_127(%rip),%xmm3 + cvtdq2ps %xmm3,%xmm6 # xexp + + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + xor %r8,%r8 + movdqa %xmm3,%xmm2 + movaps .L__real_half(%rip),%xmm5 # .5 + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrld $16,%xmm3 + lea .L__np_ln_lead_table(%rip),%rdx + movdqa %xmm3,%xmm4 + psrld $1,%xmm3 + paddd .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm4 + paddd %xmm4,%xmm3 + cvtdq2ps %xmm3,%xmm1 + packssdw %xmm3,%xmm3 + movq %xmm3,p_idx(%rsp) + + +# reduce and get u + movdqa %xmm0,%xmm3 + orps .L__real_half(%rip),%xmm2 + + + mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128 + + + subps %xmm1,%xmm2 # f2 = f - f1 + mulps %xmm2,%xmm5 + addps %xmm5,%xmm1 + + divps %xmm1,%xmm2 # u + + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1(%rsp) # save the f1 values + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1+8(%rsp) # save the f1 value + +# solve for ln(1+u) + movaps %xmm2,%xmm1 # u + mulps %xmm2,%xmm2 # u^2 + movaps %xmm2,%xmm5 + lea .L__np_ln_tail_table(%rip),%rdx + movaps .L__real_cb3(%rip),%xmm3 + mulps %xmm2,%xmm3 #Cu2 + mulps %xmm1,%xmm5 # u^3 + addps .L__real_cb2(%rip),%xmm3 #B+Cu2 + movaps %xmm2,%xmm4 + mulps %xmm5,%xmm4 # u^5 + movaps .L__real_log2_lead(%rip),%xmm2 + movaps .L__real_log2e_lead(%rip),%xmm2 + + mulps .L__real_cb1(%rip),%xmm5 #Au3 + addps %xmm5,%xmm1 # u+Au3 + mulps %xmm3,%xmm4 # u5(B+Cu2) + movaps .L__real_log2e_tail(%rip),%xmm3 + addps %xmm4,%xmm1 # poly + +# recombine + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q(%rsp) # save the f1 value + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q+8(%rsp) # save the f1 value + + addps p_q(%rsp),%xmm1 #z2 +=q + movaps %xmm1,%xmm4 #z2 copy + movaps p_z1(%rsp),%xmm0 # z1 values + movaps %xmm0,%xmm5 #z1 copy + mulps %xmm2,%xmm5 #z1*log2e_lead + mulps %xmm2,%xmm1 #z2*log2e_lead + mulps %xmm3,%xmm4 #z2*log2e_tail + mulps %xmm3,%xmm0 #z1*log2e_tail + addps %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp + addps %xmm4,%xmm0 #z1*log2e_tail + z2*log2e_tail + addps %xmm1,%xmm0 #r2 +#return r1+r2 + addps %xmm5,%xmm0 # r1+ r2 +# check for e +# test $0x0f,%r9d +# jnz .L__vlogf_e +.L__f1: + +# check for negative numbers or zero + xorps %xmm1,%xmm1 + cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also. + movmskps %xmm1,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg + +.L__f2: +## if +inf + movaps p_x(%rsp),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__log_inf +.L__f3: + + movaps p_x(%rsp),%xmm3 + subps .L__real_one(%rip),%xmm3 + andps .L__real_notsign(%rip),%xmm3 + cmpps $2,.L__real_threshold(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__near_one + + +.L__finish: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + +.L__vlogf_e: + movdqa p_x(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm0,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + jmp .L__f1 + + .align 16 +.L__near_one: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm3,p_omask(%rsp) # save ones mask + movaps p_x(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r +# u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm1 + divps %xmm2,%xmm1 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm1,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction +# u = u + u; + addps %xmm1,%xmm1 #u + movaps %xmm1,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm1,%xmm5 # Cu + movaps %xmm1,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm1 + mulps %xmm1,%xmm1 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm1 #u6(Cu+Du3) + addps %xmm1,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + + movaps %xmm3,%xmm5 #r1=r + pand .L__mask_lower(%rip),%xmm5 + subps %xmm5,%xmm3 + addps %xmm3,%xmm2 #r2 = r2 + (r-r1) + + movaps %xmm5,%xmm3 + movaps %xmm2,%xmm1 + + mulps .L__real_log2e_tail(%rip),%xmm2 + mulps .L__real_log2e_tail(%rip),%xmm3 + mulps .L__real_log2e_lead(%rip),%xmm1 + mulps .L__real_log2e_lead(%rip),%xmm5 + addps %xmm2,%xmm3 + addps %xmm1,%xmm3 + addps %xmm5,%xmm3 +# return r + r2; +# addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm0,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + + jmp .L__finish + +# we have a zero, a negative number, or both. +# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf. +.L__z_or_neg: +# deal with negatives first + movdqa %xmm1,%xmm3 + andps %xmm0,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm1 # setup the nan values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace +# check for +/- 0 + xorps %xmm1,%xmm1 + cmpps $0,p_x(%rsp),%xmm1 # 0 ?. + movmskps %xmm1,%r9d + test $0x0f,%r9d + jz .L__zn2 + + movdqa %xmm1,%xmm3 + andnps %xmm0,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + +.L__zn2: +# check for NaNs + movaps p_x(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x(%rsp),%xmm1 # isolate the NaNs + pand %xmm4,%xmm1 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm1,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm1 + andnps %xmm0,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f2 + +# handle only +inf log(+inf) = inf +.L__log_inf: + movdqa %xmm3,%xmm1 + andnps %xmm0,%xmm3 # keep the non-error values + andps p_x(%rsp),%xmm1 # setup the +inf values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + jmp .L__f3 + + + .data + .align 64 + +.L__real_zero: .quad 0x00000000000000000 # 1.0 + .quad 0x00000000000000000 +.L__real_one: .quad 0x03f8000003f800000 # 1.0 + .quad 0x03f8000003f800000 +.L__real_two: .quad 0x04000000040000000 # 1.0 + .quad 0x04000000040000000 +.L__real_ninf: .quad 0x0ff800000ff800000 # -inf + .quad 0x0ff800000ff800000 +.L__real_inf: .quad 0x07f8000007f800000 # +inf + .quad 0x07f8000007f800000 +.L__real_nan: .quad 0x07fc000007fc00000 # NaN + .quad 0x07fc000007fc00000 +.L__real_ef: .quad 0x0402DF854402DF854 # float e + .quad 0x0402DF854402DF854 + +.L__real_sign: .quad 0x08000000080000000 # sign bit + .quad 0x08000000080000000 +.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit + .quad 0x07ffFFFFF7ffFFFFF +.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit + .quad 0x00040000000400000 +.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits + .quad 0x0007FFFFF007FFFFF +.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */ + .quad 0x03c0000003c000000 +.L__mask_127: .quad 0x00000007f0000007f # + .quad 0x00000007f0000007f +.L__mask_040: .quad 0x00000004000000040 # + .quad 0x00000004000000040 +.L__mask_001: .quad 0x00000000100000001 # + .quad 0x00000000100000001 + + +.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03 + .quad 0x03CF5C28F3CF5C28F + +.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03 + .quad 0x03B1249183B124918 +.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04 + .quad 0x039E401A639E401A6 +.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03 + .quad 0x03B124A123B124A12 +.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375 + .quad 0x03F3170003F317000 +.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183 + .quad 0x03805FDF43805FDF4 +.L__real_half: .quad 0x03f0000003f000000 # 1/2 + .quad 0x03f0000003f000000 +.L__real_log2e_lead: .quad 0x03FB800003FB80000 #1.4375000000 + .quad 0x03FB800003FB80000 +.L__real_log2e_tail: .quad 0x03BAA3B293BAA3B29 # 0.0051950408889633 + .quad 0x03BAA3B293BAA3B29 + +.L__mask_lower: .quad 0x0ffff0000ffff0000 # + .quad 0x0ffff0000ffff0000 + +.L__np_ln__table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02 + .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02 + .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02 + .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02 + .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02 + .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02 + .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01 + .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01 + .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01 + .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01 + .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01 + .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01 + .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01 + .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01 + .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01 + .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01 + .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01 + .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01 + .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01 + .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01 + .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01 + .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01 + .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01 + .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01 + .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01 + .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01 + .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01 + .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01 + .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01 + .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01 + .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01 + .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01 + .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01 + .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01 + .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01 + .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01 + .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01 + .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01 + .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01 + .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01 + .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01 + .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01 + .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01 + .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01 + .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01 + .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01 + .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01 + .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01 + .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01 + .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01 + .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01 + .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01 + .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01 + .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01 + .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01 + .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01 + .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01 + .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01 + .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01 + .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01 + .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01 + .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01 + .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01 + .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_lead_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x3C7E0000 # 0.015502929688 1 + .long 0x3CFC1000 # 0.030769348145 2 + .long 0x3D3BA000 # 0.045806884766 3 + .long 0x3D785000 # 0.060623168945 4 + .long 0x3D9A0000 # 0.075195312500 5 + .long 0x3DB78000 # 0.089599609375 6 + .long 0x3DD49000 # 0.103790283203 7 + .long 0x3DF13000 # 0.117767333984 8 + .long 0x3E06B000 # 0.131530761719 9 + .long 0x3E14A000 # 0.145141601563 10 + .long 0x3E226000 # 0.158569335938 11 + .long 0x3E2FF000 # 0.171813964844 12 + .long 0x3E3D5000 # 0.184875488281 13 + .long 0x3E4A9000 # 0.197814941406 14 + .long 0x3E579000 # 0.210510253906 15 + .long 0x3E647000 # 0.223083496094 16 + .long 0x3E713000 # 0.235534667969 17 + .long 0x3E7DC000 # 0.247802734375 18 + .long 0x3E851000 # 0.259887695313 19 + .long 0x3E8B3000 # 0.271850585938 20 + .long 0x3E914000 # 0.283691406250 21 + .long 0x3E974000 # 0.295410156250 22 + .long 0x3E9D3000 # 0.307006835938 23 + .long 0x3EA30000 # 0.318359375000 24 + .long 0x3EA8D000 # 0.329711914063 25 + .long 0x3EAE8000 # 0.340820312500 26 + .long 0x3EB43000 # 0.351928710938 27 + .long 0x3EB9C000 # 0.362792968750 28 + .long 0x3EBF5000 # 0.373657226563 29 + .long 0x3EC4D000 # 0.384399414063 30 + .long 0x3ECA3000 # 0.394897460938 31 + .long 0x3ECF9000 # 0.405395507813 32 + .long 0x3ED4E000 # 0.415771484375 33 + .long 0x3EDA2000 # 0.426025390625 34 + .long 0x3EDF5000 # 0.436157226563 35 + .long 0x3EE47000 # 0.446166992188 36 + .long 0x3EE99000 # 0.456176757813 37 + .long 0x3EEEA000 # 0.466064453125 38 + .long 0x3EF3A000 # 0.475830078125 39 + .long 0x3EF89000 # 0.485473632813 40 + .long 0x3EFD7000 # 0.494995117188 41 + .long 0x3F012000 # 0.504394531250 42 + .long 0x3F039000 # 0.513916015625 43 + .long 0x3F05F000 # 0.523193359375 44 + .long 0x3F084000 # 0.532226562500 45 + .long 0x3F0AA000 # 0.541503906250 46 + .long 0x3F0CF000 # 0.550537109375 47 + .long 0x3F0F4000 # 0.559570312500 48 + .long 0x3F118000 # 0.568359375000 49 + .long 0x3F13C000 # 0.577148437500 50 + .long 0x3F160000 # 0.585937500000 51 + .long 0x3F183000 # 0.594482421875 52 + .long 0x3F1A7000 # 0.603271484375 53 + .long 0x3F1C9000 # 0.611572265625 54 + .long 0x3F1EC000 # 0.620117187500 55 + .long 0x3F20E000 # 0.628417968750 56 + .long 0x3F230000 # 0.636718750000 57 + .long 0x3F252000 # 0.645019531250 58 + .long 0x3F273000 # 0.653076171875 59 + .long 0x3F295000 # 0.661376953125 60 + .long 0x3F2B5000 # 0.669189453125 61 + .long 0x3F2D6000 # 0.677246093750 62 + .long 0x3F2F7000 # 0.685302734375 63 + .long 0x3F317000 # 0.693115234375 64 + .long 0 # for alignment + +.L__np_ln_tail_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x35A8B0FC # 0.000001256848 1 + .long 0x361B0E78 # 0.000002310522 2 + .long 0x3631EC66 # 0.000002651266 3 + .long 0x35C30046 # 0.000001452871 4 + .long 0x37EBCB0E # 0.000028108738 5 + .long 0x37528AE5 # 0.000012549314 6 + .long 0x36DA7496 # 0.000006510479 7 + .long 0x3783B715 # 0.000015701671 8 + .long 0x383F3E68 # 0.000045596069 9 + .long 0x38297C10 # 0.000040408282 10 + .long 0x3815B666 # 0.000035694240 11 + .long 0x38183854 # 0.000036292084 12 + .long 0x38448108 # 0.000046850211 13 + .long 0x373539E9 # 0.000010801924 14 + .long 0x3864A740 # 0.000054515200 15 + .long 0x387BE3CD # 0.000060055219 16 + .long 0x3803B715 # 0.000031403342 17 + .long 0x380C36AF # 0.000033429529 18 + .long 0x3892713A # 0.000069829126 19 + .long 0x38AE55D6 # 0.000083129547 20 + .long 0x38A0FDE8 # 0.000076766883 21 + .long 0x3862BAE1 # 0.000054056643 22 + .long 0x3798AAD3 # 0.000018199358 23 + .long 0x38C5E10E # 0.000094356117 24 + .long 0x382D872E # 0.000041372310 25 + .long 0x38DEDFAC # 0.000106274470 26 + .long 0x38481E9B # 0.000047712219 27 + .long 0x38EBFB5E # 0.000112524940 28 + .long 0x38783B83 # 0.000059183232 29 + .long 0x374E1B05 # 0.000012284848 30 + .long 0x38CA0E11 # 0.000096347307 31 + .long 0x3891F660 # 0.000069600297 32 + .long 0x386C9A9A # 0.000056410769 33 + .long 0x38777BCD # 0.000059004688 34 + .long 0x38A6CED4 # 0.000079540216 35 + .long 0x38FBE3CD # 0.000120110439 36 + .long 0x387E7E01 # 0.000060675669 37 + .long 0x37D40984 # 0.000025276800 38 + .long 0x3784C3AD # 0.000015826745 39 + .long 0x380F5FAF # 0.000034182969 40 + .long 0x38AC47BC # 0.000082149607 41 + .long 0x392952D3 # 0.000161479504 42 + .long 0x37F97073 # 0.000029735476 43 + .long 0x3865C84A # 0.000054784388 44 + .long 0x3979CF17 # 0.000238236375 45 + .long 0x38C3D2F5 # 0.000093376184 46 + .long 0x38E6B468 # 0.000110008579 47 + .long 0x383EBCE1 # 0.000045475437 48 + .long 0x39186BDF # 0.000145360347 49 + .long 0x392F0945 # 0.000166927537 50 + .long 0x38E9ED45 # 0.000111545007 51 + .long 0x396B99A8 # 0.000224685878 52 + .long 0x37A27674 # 0.000019367064 53 + .long 0x397069AB # 0.000229275480 54 + .long 0x39013539 # 0.000123222257 55 + .long 0x3947F423 # 0.000190690669 56 + .long 0x3945E10E # 0.000188712234 57 + .long 0x38F85DB0 # 0.000118430122 58 + .long 0x396C08DC # 0.000225100142 59 + .long 0x37B4996F # 0.000021529120 60 + .long 0x397CEADA # 0.000241200818 61 + .long 0x3920261B # 0.000152729845 62 + .long 0x35AA4906 # 0.000001268724 63 + .long 0x3805FDF4 # 0.000031946183 64 + .long 0 # for alignment +
diff --git a/src/gas/vrs4logf.S b/src/gas/vrs4logf.S new file mode 100644 index 0000000..4a39f1c --- /dev/null +++ b/src/gas/vrs4logf.S
@@ -0,0 +1,614 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrs4logf.s +# +# A vector implementation of the logf libm function. +# This routine implemented in single precision. It is slightly +# less accurate than the double precision version, but it will +# be better for vectorizing. +# +# Prototype: +# +# __m128 __vrs4_logf(__m128 x); +# +# Computes the natural log of x. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + .text + .align 16 + .p2align 4,,15 + +# define local variable storage offsets +.equ p_x,0 # save x +.equ p_idx,0x010 # xmmword index +.equ p_z1,0x020 # xmmword index +.equ p_q,0x030 # xmmword index +.equ p_corr,0x040 # xmmword index +.equ p_omask,0x050 # xmmword index +.equ save_xmm6,0x060 # +.equ save_rbx,0x070 # + +.equ stack_size,0x088 + + + +.globl __vrs4_logf + .type __vrs4_logf,@function +__vrs4_logf: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx + +# check e as a special case + movdqa %xmm0,p_x(%rsp) # save x + movdqa %xmm0,%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movmskps %xmm2,%r9d + +# +# compute the index into the log tables +# + movdqa %xmm0,%xmm3 + movaps %xmm0,%xmm1 + psrld $23,%xmm3 + subps .L__real_one(%rip),%xmm1 + psubd .L__mask_127(%rip),%xmm3 + cvtdq2ps %xmm3,%xmm6 # xexp + + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + xor %r8,%r8 + movdqa %xmm3,%xmm2 + movaps .L__real_half(%rip),%xmm5 # .5 + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrld $16,%xmm3 + lea .L__np_ln_lead_table(%rip),%rdx + movdqa %xmm3,%xmm4 + psrld $1,%xmm3 + paddd .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm4 + paddd %xmm4,%xmm3 + cvtdq2ps %xmm3,%xmm1 + packssdw %xmm3,%xmm3 + movq %xmm3,p_idx(%rsp) + + +# reduce and get u + movdqa %xmm0,%xmm3 + orps .L__real_half(%rip),%xmm2 + + + mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128 + + + subps %xmm1,%xmm2 # f2 = f - f1 + mulps %xmm2,%xmm5 + addps %xmm5,%xmm1 + + divps %xmm1,%xmm2 # u + + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1(%rsp) # save the f1 values + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1+8(%rsp) # save the f1 value + +# solve for ln(1+u) + movaps %xmm2,%xmm1 # u + mulps %xmm2,%xmm2 # u^2 + movaps %xmm2,%xmm5 + lea .L__np_ln_tail_table(%rip),%rdx + movaps .L__real_cb3(%rip),%xmm3 + mulps %xmm2,%xmm3 #Cu2 + mulps %xmm1,%xmm5 # u^3 + addps .L__real_cb2(%rip),%xmm3 #B+Cu2 + movaps %xmm2,%xmm4 + mulps %xmm5,%xmm4 # u^5 + movaps .L__real_log2_lead(%rip),%xmm2 + + mulps .L__real_cb1(%rip),%xmm5 #Au3 + addps %xmm5,%xmm1 # u+Au3 + mulps %xmm3,%xmm4 # u5(B+Cu2) + + addps %xmm4,%xmm1 # poly + +# recombine + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q(%rsp) # save the f1 value + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q+8(%rsp) # save the f1 value + + addps p_q(%rsp),%xmm1 #z2 +=q + + movaps p_z1(%rsp),%xmm0 # z1 values + + mulps %xmm6,%xmm2 + addps %xmm2,%xmm0 #r1 + mulps .L__real_log2_tail(%rip),%xmm6 + addps %xmm6,%xmm1 #r2 + addps %xmm1,%xmm0 + +# check for e + test $0x0f,%r9d + jnz .L__vlogf_e +.L__f1: + +# check for negative numbers or zero + xorps %xmm1,%xmm1 + cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also. + movmskps %xmm1,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg + +.L__f2: +## if +inf + movaps p_x(%rsp),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__log_inf +.L__f3: + + movaps p_x(%rsp),%xmm3 + subps .L__real_one(%rip),%xmm3 + andps .L__real_notsign(%rip),%xmm3 + cmpps $2,.L__real_threshold(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__near_one + + +.L__finish: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + +.L__vlogf_e: + movdqa p_x(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm0,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + jmp .L__f1 + + .align 16 +.L__near_one: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm3,p_omask(%rsp) # save ones mask + movaps p_x(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r +# u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm1 + divps %xmm2,%xmm1 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm1,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction +# u = u + u; + addps %xmm1,%xmm1 #u + movaps %xmm1,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm1,%xmm5 # Cu + movaps %xmm1,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm1 + mulps %xmm1,%xmm1 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm1 #u6(Cu+Du3) + addps %xmm1,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + +# return r + r2; + addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm0,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + + jmp .L__finish + +# we have a zero, a negative number, or both. +# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf. +.L__z_or_neg: +# deal with negatives first + movdqa %xmm1,%xmm3 + andps %xmm0,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm1 # setup the nan values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace +# check for +/- 0 + xorps %xmm1,%xmm1 + cmpps $0,p_x(%rsp),%xmm1 # 0 ?. + movmskps %xmm1,%r9d + test $0x0f,%r9d + jz .L__zn2 + + movdqa %xmm1,%xmm3 + andnps %xmm0,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + +.L__zn2: +# check for NaNs + movaps p_x(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x(%rsp),%xmm1 # isolate the NaNs + pand %xmm4,%xmm1 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm1,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm1 + andnps %xmm0,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f2 + +# handle only +inf log(+inf) = inf +.L__log_inf: + movdqa %xmm3,%xmm1 + andnps %xmm0,%xmm3 # keep the non-error values + andps p_x(%rsp),%xmm1 # setup the +inf values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + jmp .L__f3 + + + .data + .align 64 + +.L__real_zero: .quad 0x00000000000000000 # 1.0 + .quad 0x00000000000000000 +.L__real_one: .quad 0x03f8000003f800000 # 1.0 + .quad 0x03f8000003f800000 +.L__real_two: .quad 0x04000000040000000 # 1.0 + .quad 0x04000000040000000 +.L__real_ninf: .quad 0x0ff800000ff800000 # -inf + .quad 0x0ff800000ff800000 +.L__real_inf: .quad 0x07f8000007f800000 # +inf + .quad 0x07f8000007f800000 +.L__real_nan: .quad 0x07fc000007fc00000 # NaN + .quad 0x07fc000007fc00000 +.L__real_ef: .quad 0x0402DF854402DF854 # float e + .quad 0x0402DF854402DF854 + +.L__real_sign: .quad 0x08000000080000000 # sign bit + .quad 0x08000000080000000 +.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit + .quad 0x07ffFFFFF7ffFFFFF +.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit + .quad 0x00040000000400000 +.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits + .quad 0x0007FFFFF007FFFFF +.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */ + .quad 0x03c0000003c000000 +.L__mask_127: .quad 0x00000007f0000007f # + .quad 0x00000007f0000007f +.L__mask_040: .quad 0x00000004000000040 # + .quad 0x00000004000000040 +.L__mask_001: .quad 0x00000000100000001 # + .quad 0x00000000100000001 + + +.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03 + .quad 0x03CF5C28F3CF5C28F + +.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03 + .quad 0x03B1249183B124918 +.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04 + .quad 0x039E401A639E401A6 +.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03 + .quad 0x03B124A123B124A12 +.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375 + .quad 0x03F3170003F317000 +.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183 + .quad 0x03805FDF43805FDF4 +.L__real_half: .quad 0x03f0000003f000000 # 1/2 + .quad 0x03f0000003f000000 + + +.L__np_ln__table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02 + .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02 + .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02 + .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02 + .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02 + .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02 + .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01 + .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01 + .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01 + .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01 + .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01 + .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01 + .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01 + .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01 + .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01 + .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01 + .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01 + .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01 + .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01 + .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01 + .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01 + .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01 + .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01 + .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01 + .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01 + .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01 + .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01 + .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01 + .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01 + .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01 + .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01 + .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01 + .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01 + .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01 + .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01 + .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01 + .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01 + .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01 + .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01 + .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01 + .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01 + .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01 + .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01 + .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01 + .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01 + .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01 + .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01 + .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01 + .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01 + .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01 + .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01 + .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01 + .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01 + .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01 + .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01 + .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01 + .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01 + .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01 + .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01 + .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01 + .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01 + .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01 + .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01 + .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_lead_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x3C7E0000 # 0.015502929688 1 + .long 0x3CFC1000 # 0.030769348145 2 + .long 0x3D3BA000 # 0.045806884766 3 + .long 0x3D785000 # 0.060623168945 4 + .long 0x3D9A0000 # 0.075195312500 5 + .long 0x3DB78000 # 0.089599609375 6 + .long 0x3DD49000 # 0.103790283203 7 + .long 0x3DF13000 # 0.117767333984 8 + .long 0x3E06B000 # 0.131530761719 9 + .long 0x3E14A000 # 0.145141601563 10 + .long 0x3E226000 # 0.158569335938 11 + .long 0x3E2FF000 # 0.171813964844 12 + .long 0x3E3D5000 # 0.184875488281 13 + .long 0x3E4A9000 # 0.197814941406 14 + .long 0x3E579000 # 0.210510253906 15 + .long 0x3E647000 # 0.223083496094 16 + .long 0x3E713000 # 0.235534667969 17 + .long 0x3E7DC000 # 0.247802734375 18 + .long 0x3E851000 # 0.259887695313 19 + .long 0x3E8B3000 # 0.271850585938 20 + .long 0x3E914000 # 0.283691406250 21 + .long 0x3E974000 # 0.295410156250 22 + .long 0x3E9D3000 # 0.307006835938 23 + .long 0x3EA30000 # 0.318359375000 24 + .long 0x3EA8D000 # 0.329711914063 25 + .long 0x3EAE8000 # 0.340820312500 26 + .long 0x3EB43000 # 0.351928710938 27 + .long 0x3EB9C000 # 0.362792968750 28 + .long 0x3EBF5000 # 0.373657226563 29 + .long 0x3EC4D000 # 0.384399414063 30 + .long 0x3ECA3000 # 0.394897460938 31 + .long 0x3ECF9000 # 0.405395507813 32 + .long 0x3ED4E000 # 0.415771484375 33 + .long 0x3EDA2000 # 0.426025390625 34 + .long 0x3EDF5000 # 0.436157226563 35 + .long 0x3EE47000 # 0.446166992188 36 + .long 0x3EE99000 # 0.456176757813 37 + .long 0x3EEEA000 # 0.466064453125 38 + .long 0x3EF3A000 # 0.475830078125 39 + .long 0x3EF89000 # 0.485473632813 40 + .long 0x3EFD7000 # 0.494995117188 41 + .long 0x3F012000 # 0.504394531250 42 + .long 0x3F039000 # 0.513916015625 43 + .long 0x3F05F000 # 0.523193359375 44 + .long 0x3F084000 # 0.532226562500 45 + .long 0x3F0AA000 # 0.541503906250 46 + .long 0x3F0CF000 # 0.550537109375 47 + .long 0x3F0F4000 # 0.559570312500 48 + .long 0x3F118000 # 0.568359375000 49 + .long 0x3F13C000 # 0.577148437500 50 + .long 0x3F160000 # 0.585937500000 51 + .long 0x3F183000 # 0.594482421875 52 + .long 0x3F1A7000 # 0.603271484375 53 + .long 0x3F1C9000 # 0.611572265625 54 + .long 0x3F1EC000 # 0.620117187500 55 + .long 0x3F20E000 # 0.628417968750 56 + .long 0x3F230000 # 0.636718750000 57 + .long 0x3F252000 # 0.645019531250 58 + .long 0x3F273000 # 0.653076171875 59 + .long 0x3F295000 # 0.661376953125 60 + .long 0x3F2B5000 # 0.669189453125 61 + .long 0x3F2D6000 # 0.677246093750 62 + .long 0x3F2F7000 # 0.685302734375 63 + .long 0x3F317000 # 0.693115234375 64 + .long 0 # for alignment + +.L__np_ln_tail_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x35A8B0FC # 0.000001256848 1 + .long 0x361B0E78 # 0.000002310522 2 + .long 0x3631EC66 # 0.000002651266 3 + .long 0x35C30046 # 0.000001452871 4 + .long 0x37EBCB0E # 0.000028108738 5 + .long 0x37528AE5 # 0.000012549314 6 + .long 0x36DA7496 # 0.000006510479 7 + .long 0x3783B715 # 0.000015701671 8 + .long 0x383F3E68 # 0.000045596069 9 + .long 0x38297C10 # 0.000040408282 10 + .long 0x3815B666 # 0.000035694240 11 + .long 0x38183854 # 0.000036292084 12 + .long 0x38448108 # 0.000046850211 13 + .long 0x373539E9 # 0.000010801924 14 + .long 0x3864A740 # 0.000054515200 15 + .long 0x387BE3CD # 0.000060055219 16 + .long 0x3803B715 # 0.000031403342 17 + .long 0x380C36AF # 0.000033429529 18 + .long 0x3892713A # 0.000069829126 19 + .long 0x38AE55D6 # 0.000083129547 20 + .long 0x38A0FDE8 # 0.000076766883 21 + .long 0x3862BAE1 # 0.000054056643 22 + .long 0x3798AAD3 # 0.000018199358 23 + .long 0x38C5E10E # 0.000094356117 24 + .long 0x382D872E # 0.000041372310 25 + .long 0x38DEDFAC # 0.000106274470 26 + .long 0x38481E9B # 0.000047712219 27 + .long 0x38EBFB5E # 0.000112524940 28 + .long 0x38783B83 # 0.000059183232 29 + .long 0x374E1B05 # 0.000012284848 30 + .long 0x38CA0E11 # 0.000096347307 31 + .long 0x3891F660 # 0.000069600297 32 + .long 0x386C9A9A # 0.000056410769 33 + .long 0x38777BCD # 0.000059004688 34 + .long 0x38A6CED4 # 0.000079540216 35 + .long 0x38FBE3CD # 0.000120110439 36 + .long 0x387E7E01 # 0.000060675669 37 + .long 0x37D40984 # 0.000025276800 38 + .long 0x3784C3AD # 0.000015826745 39 + .long 0x380F5FAF # 0.000034182969 40 + .long 0x38AC47BC # 0.000082149607 41 + .long 0x392952D3 # 0.000161479504 42 + .long 0x37F97073 # 0.000029735476 43 + .long 0x3865C84A # 0.000054784388 44 + .long 0x3979CF17 # 0.000238236375 45 + .long 0x38C3D2F5 # 0.000093376184 46 + .long 0x38E6B468 # 0.000110008579 47 + .long 0x383EBCE1 # 0.000045475437 48 + .long 0x39186BDF # 0.000145360347 49 + .long 0x392F0945 # 0.000166927537 50 + .long 0x38E9ED45 # 0.000111545007 51 + .long 0x396B99A8 # 0.000224685878 52 + .long 0x37A27674 # 0.000019367064 53 + .long 0x397069AB # 0.000229275480 54 + .long 0x39013539 # 0.000123222257 55 + .long 0x3947F423 # 0.000190690669 56 + .long 0x3945E10E # 0.000188712234 57 + .long 0x38F85DB0 # 0.000118430122 58 + .long 0x396C08DC # 0.000225100142 59 + .long 0x37B4996F # 0.000021529120 60 + .long 0x397CEADA # 0.000241200818 61 + .long 0x3920261B # 0.000152729845 62 + .long 0x35AA4906 # 0.000001268724 63 + .long 0x3805FDF4 # 0.000031946183 64 + .long 0 # for alignment +
diff --git a/src/gas/vrs4powf.S b/src/gas/vrs4powf.S new file mode 100644 index 0000000..42b005d --- /dev/null +++ b/src/gas/vrs4powf.S
@@ -0,0 +1,623 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrs4powf.s +# +# A vector implementation of the powf libm function. +# +# Prototype: +# +# __m128 __vrs4_powf(__m128 x,__m128 y); +# +# Computes x raised to the y power. Returns proper C99 values. +# Uses new tuned fastlog/fastexp. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + +# define local variable storage offsets +.equ p_temp,0x00 # xmmword +.equ p_negateres,0x10 # qword + +.equ p_xexp,0x20 # qword + +.equ p_ux,0x030 # storage for X +.equ p_uy,0x040 # storage for Y + +.equ p_ax,0x050 # absolute x +.equ p_sx,0x060 # sign of x's + +.equ p_ay,0x070 # absolute y +.equ p_yexp,0x080 # unbiased exponent of y + +.equ p_inty,0x090 # integer y indicators +.equ save_rbx,0x0A0 # + +.equ stack_size,0x0B8 # allocate 40h more than + # we need to avoid bank conflicts + + + + .text + .align 16 + .p2align 4,,15 +.globl __vrs4_powf + .type __vrs4_powf,@function +__vrs4_powf: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx + + movaps %xmm0,p_ux(%rsp) # save x + movaps %xmm1,p_uy(%rsp) # save y + + movaps %xmm0,%xmm2 + andps .L__mask_nsign(%rip),%xmm0 # get abs x + andps .L__mask_sign(%rip),%xmm2 # mask for the sign bits + movaps %xmm0,p_ax(%rsp) # save them + movaps %xmm2,p_sx(%rsp) # save them +# convert all four x's to double + cvtps2pd p_ax(%rsp),%xmm0 + cvtps2pd p_ax+8(%rsp),%xmm1 +# +# classify y +# vector 32 bit integer method 25 cycles to here +# /* See whether y is an integer. +# inty = 0 means not an integer. +# inty = 1 means odd integer. +# inty = 2 means even integer. +# */ + movdqa p_uy(%rsp),%xmm4 + pxor %xmm3,%xmm3 + pand .L__mask_nsign(%rip),%xmm4 # get abs y in integer format + movdqa %xmm4,p_ay(%rsp) # save it + +# see if the number is less than 1.0 + psrld $23,%xmm4 #>> EXPSHIFTBITS_SP32 + + psubd .L__mask_127(%rip),%xmm4 # yexp, unbiased exponent + movdqa %xmm4,p_yexp(%rsp) # save it + paddd .L__mask_1(%rip),%xmm4 # yexp+1 + pcmpgtd %xmm3,%xmm4 # 0 if exp less than 126 (2^0) (y < 1.0), else FFs +# xmm4 is ffs if abs(y) >=1.0, else 0 + +# see if the mantissa has fractional bits +#build mask for mantissa + movdqa .L__mask_23(%rip),%xmm2 + psubd p_yexp(%rsp),%xmm2 # 24-yexp + pmaxsw %xmm3,%xmm2 # no shift counts less than 0 + movdqa %xmm2,p_temp(%rsp) # save the shift counts +# create mask for all four values +# SSE can't individual shifts so have to do 0xeac one seperately + mov p_temp(%rsp),%rcx + mov $1,%rbx + shl %cl,%ebx #1 << (24 - yexp) + shr $32,%rcx + mov $1,%eax + shl %cl,%eax #1 << (24 - yexp) + shl $32,%rax + add %rax,%rbx + mov %rbx,p_temp(%rsp) + mov p_temp+8(%rsp),%rcx + mov $1,%rbx + shl %cl,%ebx #1 << (24 - yexp) + shr $32,%rcx + mov $1,%eax + shl %cl,%eax #1 << (24 - yexp) + shl $32,%rax + add %rbx,%rax + mov %rax,p_temp+8(%rsp) + movdqa p_temp(%rsp),%xmm5 + psubd .L__mask_1(%rip),%xmm5 #= mask = (1 << (24 - yexp)) - 1 + +# now use the mask to see if there are any fractional bits + movdqa p_uy(%rsp),%xmm2 # get uy + pand %xmm5,%xmm2 # uy & mask + pcmpeqd %xmm3,%xmm2 # 0 if not zero (y has fractional mantissa bits), else FFs + pand %xmm4,%xmm2 # either 0s or ff +# xmm2 now accounts for y< 1.0 or y>=1.0 and y has fractional mantissa bits, +# it has the value 0 if we know it's non-integer or ff if integer. + +# now see if it's even or odd. + +## if yexp > 24, then it has to be even + movdqa .L__mask_24(%rip),%xmm4 + psubd p_yexp(%rsp),%xmm4 # 24-yexp + paddd .L__mask_1(%rip),%xmm5 # mask+1 = least significant integer bit + pcmpgtd %xmm3,%xmm4 ## if 0, then must be even, else ff's + + pand %xmm4,%xmm5 # set the integer bit mask to zero if yexp>24 + paddd .L__mask_2(%rip),%xmm4 + por .L__mask_2(%rip),%xmm4 + pand %xmm2,%xmm4 # result can be 0, 2, or 3 + +# now for integer numbers, see if odd or even + pand .L__mask_mant(%rip),%xmm5 # mask out exponent bits + movdqa .L__float_one(%rip),%xmm2 + pand p_uy(%rsp),%xmm5 # & uy -> even or odd + pcmpeqd p_ay(%rsp),%xmm2 # is ay equal to 1, ff's if so, then it's odd + pand .L__mask_nsign(%rip),%xmm2 # strip the sign bit so the gt comparison works. + por %xmm2,%xmm5 + pcmpgtd %xmm3,%xmm5 ## if odd then ff's, else 0's for even + paddd .L__mask_2(%rip),%xmm5 # gives us 2 for even, 1 for odd + pand %xmm5,%xmm4 + + movdqa %xmm4,p_inty(%rsp) # save inty +# +# do more x special case checking +# + movdqa %xmm4,%xmm5 + pcmpeqd %xmm3,%xmm5 # is not an integer? ff's if so + pand .L__mask_NaN(%rip),%xmm5 # these values will be NaNs, if x<0 + movdqa %xmm4,%xmm2 + pcmpeqd .L__mask_1(%rip),%xmm2 # is it odd? ff's if so + pand .L__mask_sign(%rip),%xmm2 # these values will get their sign bit set + por %xmm2,%xmm5 + + pcmpeqd p_sx(%rsp),%xmm3 ## if the signs are set + pandn %xmm5,%xmm3 # then negateres gets the values as shown below + movdqa %xmm3,p_negateres(%rsp) # save negateres + +# /* p_negateres now means the following. +# ** 7FC00000 means x<0, y not an integer, return NaN. +# ** 80000000 means x<0, y is odd integer, so set the sign bit. +# ** 0 means even integer, and/or x>=0. +# */ + + +# **** Here starts the main calculations **** +# The algorithm used is x**y = exp(y*log(x)) +# Extra precision is required in intermediate steps to meet the 1ulp requirement +# +# log(x) calculation + call __vrd4_log@PLT # get the double precision log value + # for all four x's +# y* logx +# convert all four y's to double + lea p_uy(%rsp),%rdx # get pointer to y + cvtps2pd (%rdx),%xmm2 + cvtps2pd 8(%rdx),%xmm3 + +# /* just multiply by y */ + mulpd %xmm2,%xmm0 + mulpd %xmm3,%xmm1 + +# /* The following code computes r = exp(w) */ + call __vrd4_exp@PLT # get the double exp value + # for all four y*log(x)'s +# +# convert all four results to double + cvtpd2ps %xmm0,%xmm0 + cvtpd2ps %xmm1,%xmm1 + movlhps %xmm1,%xmm0 + +# perform special case and error checking on input values + +# special case checking is done first in the scalar version since +# it allows for early fast returns. But for vectors, we consider them +# to be rare, so early returns are not necessary. So we first compute +# the x**y values, and then check for special cases. + +# we do some of the checking in reverse order of the scalar version. + lea p_uy(%rsp),%rdx # get pointer to y +# apply the negate result flags + orps p_negateres(%rsp),%xmm0 # get negateres + +## if y is infinite or so large that the result would overflow or underflow + movdqa p_ay(%rsp),%xmm4 + cmpps $5,.L__mask_ly(%rip),%xmm4 # y not less than large value, ffs if so. + movmskps %xmm4,%edx + test $0x0f,%edx + jnz .Ly_large +.Lrnsx3: + +## if x is infinite + movdqa p_ax(%rsp),%xmm4 + cmpps $0,.L__mask_inf(%rip),%xmm4 # equal to infinity, ffs if so. + movmskps %xmm4,%edx + test $0x0f,%edx + jnz .Lx_infinite +.Lrnsx1: +## if x is zero + xorps %xmm4,%xmm4 + cmpps $0,p_ax(%rsp),%xmm4 # equal to zero, ffs if so. + movmskps %xmm4,%edx + test $0x0f,%edx + jnz .Lx_zero +.Lrnsx2: +## if y is NAN + lea p_uy(%rsp),%rdx # get pointer to y + movdqa (%rdx),%xmm4 # get y + cmpps $4,%xmm4,%xmm4 # a compare not equal of y to itself should + # be false, unless y is a NaN. ff's if NaN. + movmskps %xmm4,%ecx + test $0x0f,%ecx + jnz .Ly_NaN +.Lrnsx4: +## if x is NAN + lea p_ux(%rsp),%rdx # get pointer to x + movdqa (%rdx),%xmm4 # get x + cmpps $4,%xmm4,%xmm4 # a compare not equal of x to itself should + # be false, unless x is a NaN. ff's if NaN. + movmskps %xmm4,%ecx + test $0x0f,%ecx + jnz .Lx_NaN +.Lrnsx5: + +## if |y| == 0 then return 1 + movdqa .L__float_one(%rip),%xmm3 # one + xorps %xmm2,%xmm2 + cmpps $4,p_ay(%rsp),%xmm2 # not equal to 0.0?, ffs if not equal. + andps %xmm2,%xmm0 # keep the others + andnps %xmm3,%xmm2 # mask for ones + orps %xmm2,%xmm0 +## if x == +1, return +1 for all x + lea p_ux(%rsp),%rdx # get pointer to x + movdqa %xmm3,%xmm2 + cmpps $4,(%rdx),%xmm2 # not equal to +1.0?, ffs if not equal. + andps %xmm2,%xmm0 # keep the others + andnps %xmm3,%xmm2 # mask for ones + orps %xmm2,%xmm0 + +.L__powf_cleanup2: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + + .align 16 +# y is a NaN. +.Ly_NaN: + lea p_uy(%rsp),%rdx # get pointer to y + movdqa (%rdx),%xmm4 # get y + movdqa %xmm4,%xmm3 + movdqa %xmm4,%xmm5 + movdqa .L__mask_sigbit(%rip),%xmm2 # get the signalling bits + cmpps $0,%xmm4,%xmm4 # a compare equal of y to itself should + # be true, unless y is a NaN. 0's if NaN. + cmpps $4,%xmm3,%xmm3 # compare not equal, ff's if NaN. + andps %xmm4,%xmm0 # keep the other results + andps %xmm3,%xmm2 # get just the right signalling bits + andps %xmm5,%xmm3 # mask for the NaNs + orps %xmm2,%xmm3 # convert to QNaNs + orps %xmm3,%xmm0 # combine + jmp .Lrnsx4 + +# y is a NaN. +.Lx_NaN: + lea p_ux(%rsp),%rcx # get pointer to x + movdqa (%rcx),%xmm4 # get x + movdqa %xmm4,%xmm3 + movdqa %xmm4,%xmm5 + movdqa .L__mask_sigbit(%rip),%xmm2 # get the signalling bits + cmpps $0,%xmm4,%xmm4 # a compare equal of x to itself should + # be true, unless x is a NaN. 0's if NaN. + cmpps $4,%xmm3,%xmm3 # compare not equal, ff's if NaN. + andps %xmm4,%xmm0 # keep the other results + andps %xmm3,%xmm2 # get just the right signalling bits + andps %xmm5,%xmm3 # mask for the NaNs + orps %xmm2,%xmm3 # convert to QNaNs + orps %xmm3,%xmm0 # combine + jmp .Lrnsx5 + +# * y is infinite or so large that the result would +# overflow or underflow. +.Ly_large: + movdqa %xmm0,p_temp(%rsp) + + test $1,%edx + jz .Lylrga + lea p_ux(%rsp),%rcx # get pointer to x + lea p_uy(%rsp),%rbx # get pointer to y + mov (%rcx),%eax + mov (%rbx),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp(%rsp) +.Lylrga: + test $2,%edx + jz .Lylrgb + lea p_ux(%rsp),%rcx # get pointer to x + lea p_uy(%rsp),%rbx # get pointer to y + mov 4(%rcx),%eax + mov 4(%rbx),%ebx + mov p_inty+4(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+4(%rsp) +.Lylrgb: + test $4,%edx + jz .Lylrgc + lea p_ux(%rsp),%rcx # get pointer to x + lea p_uy(%rsp),%rbx # get pointer to y + mov 8(%rcx),%eax + mov 8(%rbx),%ebx + mov p_inty+8(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+8(%rsp) +.Lylrgc: + test $8,%edx + jz .Lylrgd + lea p_ux(%rsp),%rcx # get pointer to x + lea p_uy(%rsp),%rbx # get pointer to y + mov 12(%rcx),%eax + mov 12(%rbx),%ebx + mov p_inty+12(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+12(%rsp) +.Lylrgd: + movdqa p_temp(%rsp),%xmm0 + jmp .Lrnsx3 + +# a subroutine to treat an individual x,y pair when y is large or infinity +# assumes x in .Ly,%eax in ebx. +# returns result in eax +.Lnp_special6: +# handle |x|==1 cases first + mov $0x07FFFFFFF,%r8d + and %eax,%r8d + cmp $0x03f800000,%r8d # jump if |x| !=1 + jnz .Lnps6 + mov $0x03f800000,%eax # return 1 for all |x|==1 + jmp .Lnpx64 + +# cases where |x| !=1 +.Lnps6: + mov $0x07f800000,%ecx + xor %eax,%eax # assume 0 return + test $0x080000000,%ebx + jnz .Lnps62 # jump if y negative +# y = +inf + cmp $0x03f800000,%r8d + cmovg %ecx,%eax # return inf if |x| < 1 + jmp .Lnpx64 +.Lnps62: +# y = -inf + cmp $0x03f800000,%r8d + cmovl %ecx,%eax # return inf if |x| < 1 + jmp .Lnpx64 + +.Lnpx64: + ret + +# handle cases where x is +/- infinity. edx is the mask + .align 16 +.Lx_infinite: + movdqa %xmm0,p_temp(%rsp) + + test $1,%edx + jz .Lxinfa + lea p_ux(%rsp),%rcx # get pointer to x + lea p_uy(%rsp),%rbx # get pointer to y + mov (%rcx),%eax + mov (%rbx),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp(%rsp) +.Lxinfa: + test $2,%edx + jz .Lxinfb + lea p_ux(%rsp),%rcx # get pointer to x + lea p_uy(%rsp),%rbx # get pointer to y + mov 4(%rcx),%eax + mov 4(%rbx),%ebx + mov p_inty+4(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+4(%rsp) +.Lxinfb: + test $4,%edx + jz .Lxinfc + lea p_ux(%rsp),%rcx # get pointer to x + lea p_uy(%rsp),%rbx # get pointer to y + mov 8(%rcx),%eax + mov 8(%rbx),%ebx + mov p_inty+8(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+8(%rsp) +.Lxinfc: + test $8,%edx + jz .Lxinfd + lea p_ux(%rsp),%rcx # get pointer to x + lea p_uy(%rsp),%rbx # get pointer to y + mov 12(%rcx),%eax + mov 12(%rbx),%ebx + mov p_inty+12(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+12(%rsp) +.Lxinfd: + movdqa p_temp(%rsp),%xmm0 + jmp .Lrnsx1 + +# a subroutine to treat an individual x,y pair when x is +/-infinity +# assumes x in .Ly,%eax in ebx, inty in ecx. +# returns result in eax +.Lnp_special_x1: # x is infinite + test $0x080000000,%eax # is x positive + jnz .Lnsx11 # jump if not + test $0x080000000,%ebx # is y positive + jz .Lnsx13 # just return if so + xor %eax,%eax # else return 0 + jmp .Lnsx13 + +.Lnsx11: + cmp $1,%ecx ## if inty ==1 + jnz .Lnsx12 # jump if not + test $0x080000000,%ebx # is y positive + jz .Lnsx13 # just return if so + mov $0x080000000,%eax # else return -0 + jmp .Lnsx13 +.Lnsx12: # inty <>1 + and $0x07FFFFFFF,%eax # return -x (|x|) if y<0 + test $0x080000000,%ebx # is y positive + jz .Lnsx13 # + xor %eax,%eax # return 0 if y >=0 +.Lnsx13: + ret + + +# handle cases where x is +/- zero. edx is the mask of x,y pairs with |x|=0 + .align 16 +.Lx_zero: + movdqa %xmm0,p_temp(%rsp) + + test $1,%edx + jz .Lxzera + lea p_ux(%rsp),%rcx # get pointer to x + lea p_uy(%rsp),%rbx # get pointer to y + mov (%rcx),%eax + mov (%rbx),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp(%rsp) +.Lxzera: + test $2,%edx + jz .Lxzerb + lea p_ux(%rsp),%rcx # get pointer to x + lea p_uy(%rsp),%rbx # get pointer to y + mov 4(%rcx),%eax + mov 4(%rbx),%ebx + mov p_inty+4(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+4(%rsp) +.Lxzerb: + test $4,%edx + jz .Lxzerc + lea p_ux(%rsp),%rcx # get pointer to x + lea p_uy(%rsp),%rbx # get pointer to y + mov 8(%rcx),%eax + mov 8(%rbx),%ebx + mov p_inty+8(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+8(%rsp) +.Lxzerc: + test $8,%edx + jz .Lxzerd + lea p_ux(%rsp),%rcx # get pointer to x + lea p_uy(%rsp),%rbx # get pointer to y + mov 12(%rcx),%eax + mov 12(%rbx),%ebx + mov p_inty+12(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+12(%rsp) +.Lxzerd: + movdqa p_temp(%rsp),%xmm0 + jmp .Lrnsx2 + +# a subroutine to treat an individual x,y pair when x is +/-0 +# assumes x in .Ly,%eax in ebx, inty in ecx. +# returns result in eax + .align 16 +.Lnp_special_x2: + cmp $1,%ecx ## if inty ==1 + jz .Lnsx21 # jump if so +# handle cases of x=+/-0, y not integer + xor %eax,%eax + mov $0x07f800000,%ecx + test $0x080000000,%ebx # is ypos + cmovnz %ecx,%eax + jmp .Lnsx23 +# y is an integer +.Lnsx21: + xor %r8d,%r8d + mov $0x07f800000,%ecx + test $0x080000000,%ebx # is ypos + cmovnz %ecx,%r8d # set to infinity if not + and $0x080000000,%eax # pickup the sign of x + or %r8d,%eax # and include it in the result +.Lnsx23: + ret + + + .data + .align 64 + +.L__mask_sign: .quad 0x08000000080000000 # a sign bit mask + .quad 0x08000000080000000 + +.L__mask_nsign: .quad 0x07FFFFFFF7FFFFFFF # a not sign bit mask + .quad 0x07FFFFFFF7FFFFFFF + +# used by inty +.L__mask_127: .quad 0x00000007F0000007F # EXPBIAS_SP32 + .quad 0x00000007F0000007F + +.L__mask_mant: .quad 0x0007FFFFF007FFFFF # mantissa bit mask + .quad 0x0007FFFFF007FFFFF + +.L__mask_1: .quad 0x00000000100000001 # 1 + .quad 0x00000000100000001 + +.L__mask_2: .quad 0x00000000200000002 # 2 + .quad 0x00000000200000002 + +.L__mask_24: .quad 0x00000001800000018 # 24 + .quad 0x00000001800000018 + +.L__mask_23: .quad 0x00000001700000017 # 23 + .quad 0x00000001700000017 + +# used by special case checking + +.L__float_one: .quad 0x03f8000003f800000 # one + .quad 0x03f8000003f800000 + +.L__mask_inf: .quad 0x07f8000007F800000 # inifinity + .quad 0x07f8000007F800000 + +.L__mask_NaN: .quad 0x07fC000007FC00000 # NaN + .quad 0x07fC000007FC00000 + +.L__mask_sigbit: .quad 0x00040000000400000 # QNaN bit + .quad 0x00040000000400000 + +.L__mask_ly: .quad 0x04f0000004f000000 # large y + .quad 0x04f0000004f000000
diff --git a/src/gas/vrs4powxf.S b/src/gas/vrs4powxf.S new file mode 100644 index 0000000..e18b5db --- /dev/null +++ b/src/gas/vrs4powxf.S
@@ -0,0 +1,538 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrs4powxf.asm +# +# A vector implementation of the powf libm function. +# This routine raises the x vector to a constant y power. +# +# Prototype: +# +# __m128 __vrs4_powxf(__m128 x,float y); +# +# Computes x raised to the y power. Returns proper C99 values. +# +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + + +# define local variable storage offsets +.equ p_temp,0x00 # xmmword +.equ p_negateres,0x10 # qword + +.equ save_rbx,0x020 #qword +.equ save_rsi,0x028 #qword + +.equ p_xptr,0x030 # ptr to x values +.equ p_y,0x038 # y value + +.equ p_inty,0x040 # integer y indicators + +.equ p_ux,0x050 # absolute x +.equ p_ax,0x060 # absolute x +.equ p_sx,0x070 # sign of x's + +.equ stack_size,0x088 # + + + + + + .text + .align 16 + .p2align 4,,15 +.globl __vrs4_powxf + .type __vrs4_powxf,@function +__vrs4_powxf: + + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx + + lea p_ux(%rsp),%rcx + mov %rcx,p_xptr(%rsp) # save pointer to x + movaps %xmm0,(%rcx) + movss %xmm1,p_y(%rsp) # save y + + movdqa %xmm1,%xmm4 + + movaps %xmm0,%xmm2 + andps .L__mask_nsign(%rip),%xmm0 # get abs x + andps .L__mask_sign(%rip),%xmm2 # mask for the sign bits + movaps %xmm0,p_ax(%rsp) # save them + movaps %xmm2,p_sx(%rsp) # save them +# convert all four x's to double + cvtps2pd p_ax(%rsp),%xmm0 + cvtps2pd p_ax+8(%rsp),%xmm1 +# +# classify y +# vector 32 bit integer method 25 cycles to here +# /* See whether y is an integer. +# inty = 0 means not an integer. +# */ +# get yexp + mov p_y(%rsp),%r8d # r8 is uy + mov $0x07fffffff,%r9d + and %r8d,%r9d # r9 is ay + +## if |y| == 0 then return 1 + cmp $0,%r9d # is y a zero? + jz .Ly_zero + + mov $0x07f800000,%eax # EXPBITS_SP32 + and %r9d,%eax # y exp + + xor %edi,%edi + shr $23,%eax #>> EXPSHIFTBITS_SP32 + sub $126,%eax # - EXPBIAS_SP32 + 1 - eax is now the unbiased exponent + mov $1,%ebx + cmp %ebx,%eax ## if (yexp < 1) + cmovl %edi,%ebx + jl .Lsave_inty + + mov $24,%ecx + cmp %ecx,%eax ## if (yexp >24) + jle .Linfy1 + mov $2,%ebx + jmp .Lsave_inty +.Linfy1: # else 1<=yexp<=24 + sub %eax,%ecx # build mask for mantissa + shl %cl,%ebx + dec %ebx # rbx = mask = (1 << (24 - yexp)) - 1 + + mov %r8d,%eax + and %ebx,%eax ## if ((uy & mask) != 0) + cmovnz %edi,%ebx # inty = 0; + jnz .Lsave_inty + + not %ebx # else if (((uy & ~mask) >> (24 - yexp)) & 0x00000001) + mov %r8d,%eax + and %ebx,%eax + shr %cl,%eax + inc %edi + and %edi,%eax + mov %edi,%ebx # inty = 1 + jnz .Lsave_inty + inc %ebx # else inty = 2 + + +.Lsave_inty: + mov %r8d,p_y+4(%rsp) # r8d is ay + mov %ebx,p_inty(%rsp) # save inty +# +# do more x special case checking +# + pxor %xmm3,%xmm3 + xor %eax,%eax + mov $0x07FC00000,%ecx + cmp $0,%ebx # is y not an integer? + cmovz %ecx,%eax # then set to return a NaN. else 0. + mov $0x080000000,%ecx + cmp $1,%ebx # is y an odd integer? + cmovz %ecx,%eax # maybe set sign bit if so + movd %eax,%xmm5 + pshufd $0,%xmm5,%xmm5 + + pcmpeqd p_sx(%rsp),%xmm3 ## if the signs are set + pandn %xmm5,%xmm3 # then negateres gets the values as shown below + movdqa %xmm3,p_negateres(%rsp) # save negateres + +# /* p_negateres now means the following. +# 7FC00000 means x<0, y not an integer, return NaN. +# 80000000 means x<0, y is odd integer, so set the sign bit. +## 0 means even integer, and/or x>=0. +# */ + +# **** Here starts the main calculations **** +# The algorithm used is x**y = exp(y*log(x)) +# Extra precision is required in intermediate steps to meet the 1ulp requirement +# +# log(x) calculation + call __vrd4_log@PLT # get the double precision log value + # for all four x's +# y* logx + cvtps2pd p_y(%rsp),%xmm2 #convert the two packed single y's to double + +# /* just multiply by y */ + mulpd %xmm2,%xmm0 + mulpd %xmm2,%xmm1 + +# /* The following code computes r = exp(w) */ + call __vrd4_exp@PLT # get the double exp value + # for all four y*log(x)'s +# +# convert all four results to double + cvtpd2ps %xmm0,%xmm0 + cvtpd2ps %xmm1,%xmm1 + movlhps %xmm1,%xmm0 + +# perform special case and error checking on input values + +# special case checking is done first in the scalar version since +# it allows for early fast returns. But for vectors, we consider them +# to be rare, so early returns are not necessary. So we first compute +# the x**y values, and then check for special cases. + +# we do some of the checking in reverse order of the scalar version. +# apply the negate result flags + orps p_negateres(%rsp),%xmm0 # get negateres + +## if y is infinite or so large that the result would overflow or underflow + mov p_y(%rsp),%edx # get y + and $0x07fffffff,%edx # develop ay + cmp $0x04f000000,%edx + ja .Ly_large +.Lrnsx3: + +## if x is infinite + movdqa p_ax(%rsp),%xmm4 + cmpps $0,.L__mask_inf(%rip),%xmm4 # equal to infinity, ffs if so. + movmskps %xmm4,%edx + test $0x0f,%edx + jnz .Lx_infinite +.Lrnsx1: +## if x is zero + xorps %xmm4,%xmm4 + cmpps $0,p_ax(%rsp),%xmm4 # equal to zero, ffs if so. + movmskps %xmm4,%edx + test $0x0f,%edx + jnz .Lx_zero +.Lrnsx2: +## if y is NAN + movss p_y(%rsp),%xmm4 # get y + ucomiss %xmm4,%xmm4 # comparing y to itself should + # be true, unless y is a NaN. parity flag if NaN. + jp .Ly_NaN +.Lrnsx4: +## if x is NAN + movdqa p_ax(%rsp),%xmm4 # get x + cmpps $4,%xmm4,%xmm4 # a compare not equal of x to itself should + # be false, unless x is a NaN. ff's if NaN. + movmskps %xmm4,%ecx + test $0x0f,%ecx + jnz .Lx_NaN +.Lrnsx5: + +## if x == +1, return +1 for all x + movdqa .L__float_one(%rip),%xmm3 # one + mov p_xptr(%rsp),%rdx # get pointer to x + movdqa %xmm3,%xmm2 + cmpps $4,(%rdx),%xmm2 # not equal to +1.0?, ffs if not equal. + andps %xmm2,%xmm0 # keep the others + andnps %xmm3,%xmm2 # mask for ones + orps %xmm2,%xmm0 + +.L__powf_cleanup2: + + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + + .align 16 +.Ly_zero: +## if |y| == 0 then return 1 + movdqa .L__float_one(%rip),%xmm0 # one + jmp .L__powf_cleanup2 +# * y is a NaN. +.Ly_NaN: + mov p_y(%rsp),%r8d + or $0x000400000,%r8d # convert to QNaNs + movd %r8d,%xmm0 # propagate to all results + shufps $0,%xmm0,%xmm0 + jmp .Lrnsx4 + +# y is a NaN. +.Lx_NaN: + mov p_xptr(%rsp),%rcx # get pointer to x + movdqa (%rcx),%xmm4 # get x + movdqa %xmm4,%xmm3 + movdqa %xmm4,%xmm5 + movdqa .L__mask_sigbit(%rip),%xmm2 # get the signalling bits + cmpps $0,%xmm4,%xmm4 # a compare equal of x to itself should + # be true, unless x is a NaN. 0's if NaN. + cmpps $4,%xmm3,%xmm3 # compare not equal, ff's if NaN. + andps %xmm4,%xmm0 # keep the other results + andps %xmm3,%xmm2 # get just the right signalling bits + andps %xmm5,%xmm3 # mask for the NaNs + orps %xmm2,%xmm3 # convert to QNaNs + orps %xmm3,%xmm0 # combine + jmp .Lrnsx5 + +# * y is infinite or so large that the result would +# overflow or underflow. +.Ly_large: + movdqa %xmm0,p_temp(%rsp) + + mov p_xptr(%rsp),%rcx # get pointer to x + mov (%rcx),%eax + mov p_y(%rsp),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp(%rsp) + + mov p_xptr(%rsp),%rcx # get pointer to x + mov 4(%rcx),%eax + mov p_y(%rsp),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+4(%rsp) + + mov p_xptr(%rsp),%rcx # get pointer to x + mov 8(%rcx),%eax + mov p_y(%rsp),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+8(%rsp) + + mov p_xptr(%rsp),%rcx # get pointer to x + mov 12(%rcx),%eax + mov p_y(%rsp),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+12(%rsp) + + movdqa p_temp(%rsp),%xmm0 + jmp .Lrnsx3 + +# a subroutine to treat an individual x,y pair when y is large or infinity +# assumes x in .Ly(%rip),%eax in ebx. +# returns result in eax +.Lnp_special6: +# handle |x|==1 cases first + mov $0x07FFFFFFF,%r8d + and %eax,%r8d + cmp $0x03f800000,%r8d # jump if |x| !=1 + jnz .Lnps6 + mov $0x03f800000,%eax # return 1 for all |x|==1 + jmp .Lnpx64 + +# cases where |x| !=1 +.Lnps6: + mov $0x07f800000,%ecx + xor %eax,%eax # assume 0 return + test $0x080000000,%ebx + jnz .Lnps62 # jump if y negative +# y = +inf + cmp $0x03f800000,%r8d + cmovg %ecx,%eax # return inf if |x| < 1 + jmp .Lnpx64 +.Lnps62: +# y = -inf + cmp $0x03f800000,%r8d + cmovl %ecx,%eax # return inf if |x| < 1 + jmp .Lnpx64 + +.Lnpx64: + ret + +# handle cases where x is +/- infinity. edx is the mask + .align 16 +.Lx_infinite: + movdqa %xmm0,p_temp(%rsp) + + test $1,%edx + jz .Lxinfa + mov p_xptr(%rsp),%rcx # get pointer to x + mov (%rcx),%eax + mov p_y(%rsp),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp(%rsp) +.Lxinfa: + test $2,%edx + jz .Lxinfb + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov 4(%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+4(%rsp) +.Lxinfb: + test $4,%edx + jz .Lxinfc + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov 8(%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+8(%rsp) +.Lxinfc: + test $8,%edx + jz .Lxinfd + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov 12(%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+12(%rsp) +.Lxinfd: + movdqa p_temp(%rsp),%xmm0 + jmp .Lrnsx1 + +# a subroutine to treat an individual x,y pair when x is +/-infinity +# assumes x in .Ly(%rip),%eax in ebx, inty in ecx. +# returns result in eax +.Lnp_special_x1: # x is infinite + test $0x080000000,%eax # is x positive + jnz .Lnsx11 # jump if not + test $0x080000000,%ebx # is y positive + jz .Lnsx13 # just return if so + xor %eax,%eax # else return 0 + jmp .Lnsx13 + +.Lnsx11: + cmp $1,%ecx ## if inty ==1 + jnz .Lnsx12 # jump if not + test $0x080000000,%ebx # is y positive + jz .Lnsx13 # just return if so + mov $0x080000000,%eax # else return -0 + jmp .Lnsx13 +.Lnsx12: # inty <>1 + and $0x07FFFFFFF,%eax # return -x (|x|) if y<0 + test $0x080000000,%ebx # is y positive + jz .Lnsx13 # + xor %eax,%eax # return 0 if y >=0 +.Lnsx13: + ret + + +# handle cases where x is +/- zero. edx is the mask of x,y pairs with |x|=0 + .align 16 +.Lx_zero: + movdqa %xmm0,p_temp(%rsp) + + test $1,%edx + jz .Lxzera + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov (%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp(%rsp) +.Lxzera: + test $2,%edx + jz .Lxzerb + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov 4(%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+4(%rsp) +.Lxzerb: + test $4,%edx + jz .Lxzerc + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov 8(%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+8(%rsp) +.Lxzerc: + test $8,%edx + jz .Lxzerd + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov 12(%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+12(%rsp) +.Lxzerd: + movdqa p_temp(%rsp),%xmm0 + jmp .Lrnsx2 + +# a subroutine to treat an individual x,y pair when x is +/-0 +# assumes x in .Ly(%rip),%eax in ebx, inty in ecx. +# returns result in eax + .align 16 +.Lnp_special_x2: + cmp $1,%ecx ## if inty ==1 + jz .Lnsx21 # jump if so +# handle cases of x=+/-0, y not integer + xor %eax,%eax + mov $0x07f800000,%ecx + test $0x080000000,%ebx # is ypos + cmovnz %ecx,%eax + jmp .Lnsx23 +# y is an integer +.Lnsx21: + xor %r8d,%r8d + mov $0x07f800000,%ecx + test $0x080000000,%ebx # is ypos + cmovnz %ecx,%r8d # set to infinity if not + and $0x080000000,%eax # pickup the sign of x + or %r8d,%eax # and include it in the result +.Lnsx23: + ret + + + .data + .align 64 + +.L__mask_sign: .quad 0x08000000080000000 # a sign bit mask + .quad 0x08000000080000000 + +.L__mask_nsign: .quad 0x07FFFFFFF7FFFFFFF # a not sign bit mask + .quad 0x07FFFFFFF7FFFFFFF + +# used by special case checking + +.L__float_one: .quad 0x03f8000003f800000 # one + .quad 0x03f8000003f800000 + +.L__mask_inf: .quad 0x07f8000007F800000 # inifinity + .quad 0x07f8000007F800000 + +.L__mask_sigbit: .quad 0x00040000000400000 # QNaN bit + .quad 0x00040000000400000 + +
diff --git a/src/gas/vrs4sincosf.S b/src/gas/vrs4sincosf.S new file mode 100644 index 0000000..2c3a0cc --- /dev/null +++ b/src/gas/vrs4sincosf.S
@@ -0,0 +1,1813 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrs4sincosf.asm +# +# A vector implementation of the sincos libm function. +# +# Prototype: +# +# __vrs4_sincosf(__m128 x, __m128 * ys, __m128 * yc); +# +# Computes Sine and Cosine of x for an array of input values. +# Places the Sine results into the supplied ys array and the Cosine results into the supplied yc array. +# Does not perform error checking. +# Denormal inputs may produce unexpected results. +# This routine computes 4 single precision Sine Cosine values at a time. +# The four values are passed as packed single in xmm0. +# The four Sine results are returned as packed singles in the supplied ys array. +# The four Cosine results are returned as packed singles in the supplied yc array. +# Note that this represents a non-standard ABI usage, as no ABI +# ( and indeed C) currently allows returning 2 values for a function. +# It is expected that some compilers may be able to take advantage of this +# interface when implementing vectorized loops. Using the array implementation +# of the routine requires putting the inputs into memory, and retrieving +# the results from memory. This routine eliminates the need for this +# overhead if the data does not already reside in memory. + +# Author: Harsha Jagasia +# Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 64 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # + +.Lcosarray: + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x03FA5555555502F31 + .quad 0x0BF56C16BF55699D7 # -0.00138889 c2 + .quad 0x0BF56C16BF55699D7 + .quad 0x03EFA015C50A93B49 # 2.48016e-005 c3 + .quad 0x03EFA015C50A93B49 + .quad 0x0BE92524743CC46B8 # -2.75573e-007 c4 + .quad 0x0BE92524743CC46B8 + +.Lsinarray: + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x0BFC555555545E87D + .quad 0x03F811110DF01232D # 0.00833333 s2 + .quad 0x03F811110DF01232D + .quad 0x0BF2A013A88A37196 # -0.000198413 s3 + .quad 0x0BF2A013A88A37196 + .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4 + .quad 0x03EC6DBE4AD1572D5 + +.Lsincosarray: + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x03F811110DF01232D # 0.00833333 s2 + .quad 0x0BF56C16BF55699D7 + .quad 0x0BF2A013A88A37196 # -0.000198413 s3 + .quad 0x03EFA015C50A93B49 + .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4 + .quad 0x0BE92524743CC46B8 + +.Lcossinarray: + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x0BF56C16BF55699D7 # c2 + .quad 0x03F811110DF01232D + .quad 0x03EFA015C50A93B49 # c3 + .quad 0x0BF2A013A88A37196 + .quad 0x0BE92524743CC46B8 # c4 + .quad 0x03EC6DBE4AD1572D5 + +.align 64 + .Levensin_oddcos_tbl: + + .quad .Lsinsin_sinsin_piby4 # 0 * ; Done + .quad .Lsinsin_sincos_piby4 # 1 + ; Done + .quad .Lsinsin_cossin_piby4 # 2 ; Done + .quad .Lsinsin_coscos_piby4 # 3 + ; Done + + .quad .Lsincos_sinsin_piby4 # 4 ; Done + .quad .Lsincos_sincos_piby4 # 5 * ; Done + .quad .Lsincos_cossin_piby4 # 6 ; Done + .quad .Lsincos_coscos_piby4 # 7 ; Done + + .quad .Lcossin_sinsin_piby4 # 8 ; Done + .quad .Lcossin_sincos_piby4 # 9 ; TBD + .quad .Lcossin_cossin_piby4 # 10 * ; Done + .quad .Lcossin_coscos_piby4 # 11 ; Done + + .quad .Lcoscos_sinsin_piby4 # 12 ; Done + .quad .Lcoscos_sincos_piby4 # 13 + ; Done + .quad .Lcoscos_cossin_piby4 # 14 ; Done + .quad .Lcoscos_coscos_piby4 # 15 * ; Done + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + .text + .align 16 + .p2align 4,,15 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# define local variable storage offsets +.equ p_temp,0 # temporary for get/put bits operation +.equ p_temp1,0x10 # temporary for get/put bits operation + +.equ save_xmm6,0x20 # temporary for get/put bits operation +.equ save_xmm7,0x30 # temporary for get/put bits operation +.equ save_xmm8,0x40 # temporary for get/put bits operation +.equ save_xmm9,0x50 # temporary for get/put bits operation +.equ save_xmm0,0x60 # temporary for get/put bits operation +.equ save_xmm11,0x70 # temporary for get/put bits operation +.equ save_xmm12,0x80 # temporary for get/put bits operation +.equ save_xmm13,0x90 # temporary for get/put bits operation +.equ save_xmm14,0x0A0 # temporary for get/put bits operation +.equ save_xmm15,0x0B0 # temporary for get/put bits operation + +.equ r,0x0C0 # pointer to r for remainder_piby2 +.equ rr,0x0D0 # pointer to r for remainder_piby2 +.equ region,0x0E0 # pointer to r for remainder_piby2 + +.equ r1,0x0F0 # pointer to r for remainder_piby2 +.equ rr1,0x0100 # pointer to r for remainder_piby2 +.equ region1,0x0110 # pointer to r for remainder_piby2 + +.equ p_temp2,0x0120 # temporary for get/put bits operation +.equ p_temp3,0x0130 # temporary for get/put bits operation + +.equ p_temp4,0x0140 # temporary for get/put bits operation +.equ p_temp5,0x0150 # temporary for get/put bits operation + +.equ p_original,0x0160 # original x +.equ p_mask,0x0170 # original x +.equ p_sign_sin,0x0180 # original x + +.equ p_original1,0x0190 # original x +.equ p_mask1,0x01A0 # original x +.equ p_sign1_sin,0x01B0 # original x + + +.equ save_r12,0x01C0 # temporary for get/put bits operation +.equ save_r13,0x01D0 # temporary for get/put bits operation + +.equ p_sin,0x01E0 # sin +.equ p_cos,0x01F0 # cos + +.equ save_rdi,0x0200 # temporary for get/put bits operation +.equ save_rsi,0x0210 # temporary for get/put bits operation + +.equ p_sign_cos,0x0220 # Sign of lower cos term +.equ p_sign1_cos,0x0230 # Sign of upper cos term + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +.globl __vrs4_sincosf + .type __vrs4_sincosf,@function +__vrs4_sincosf: + + sub $0x0248,%rsp + + mov %r12,save_r12(%rsp) # save r12 + + mov %r13,save_r13(%rsp) # save r13 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#STARTMAIN + + movhlps %xmm0,%xmm8 + cvtps2pd %xmm0,%xmm10 # convert input to double. + cvtps2pd %xmm8,%xmm1 # convert input to double. + +movdqa %xmm10,%xmm6 +movdqa %xmm1,%xmm7 +movapd .L__real_7fffffffffffffff(%rip),%xmm2 + +andpd %xmm2,%xmm10 #Unsign +andpd %xmm2,%xmm1 #Unsign + +mov %rdi, p_sin(%rsp) # save address for sin return +mov %rsi, p_cos(%rsp) # save address for cos return + +movd %xmm10,%rax #rax is lower arg +movhpd %xmm10, p_temp+8(%rsp) # +mov p_temp+8(%rsp),%rcx #rcx = upper arg + +movd %xmm1,%r8 #r8 is lower arg +movhpd %xmm1, p_temp1+8(%rsp) # +mov p_temp1+8(%rsp),%r9 #r9 = upper arg + +movdqa %xmm10,%xmm12 +movdqa %xmm1,%xmm13 + +pcmpgtd %xmm6,%xmm12 +pcmpgtd %xmm7,%xmm13 +movdqa %xmm12,%xmm6 +movdqa %xmm13,%xmm7 +psrldq $4,%xmm12 +psrldq $4,%xmm13 +psrldq $8,%xmm6 +psrldq $8,%xmm7 + +mov $0x3FE921FB54442D18,%rdx #piby4 + +mov $0x411E848000000000,%r10 #5e5 + +movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + + +por %xmm6,%xmm12 +por %xmm7,%xmm13 + +movd %xmm12,%r12 #Move Sign to gpr ** +movd %xmm13,%r13 #Move Sign to gpr ** + +movapd %xmm10,%xmm2 #x0 +movapd %xmm1,%xmm3 #x1 +movapd %xmm10,%xmm6 #x0 +movapd %xmm1,%xmm7 #x1 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm2 = x, xmm4 =0.5/t, xmm6 =x +# xmm3 = x, xmm5 =0.5/t, xmm7 =x +.align 16 +.Leither_or_both_arg_gt_than_piby4: + cmp %r10,%rax + jae .Lfirst_or_next3_arg_gt_5e5 + + cmp %r10,%rcx + jae .Lsecond_or_next2_arg_gt_5e5 + + cmp %r10,%r8 + jae .Lthird_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfourth_arg_gt_5e5 + + +# /* Find out what multiple of piby2 */ +# npi2 = (int)(x * twobypi + 0.5); + movapd .L__real_3fe45f306dc9c883(%rip),%xmm10 + mulpd %xmm10,%xmm2 # * twobypi + mulpd %xmm10,%xmm3 # * twobypi + + addpd %xmm4,%xmm2 # +0.5, npi2 + addpd %xmm4,%xmm3 # +0.5, npi2 + + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + + cvtdq2pd %xmm4,%xmm2 # and back to double. + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + + movd %xmm4,%r8 # Region + movd %xmm5,%r9 # Region + +#DELETE +# mov .LQWORD,%rdx PTR __reald_one_zero ;compare value for cossin path +#DELETE + + mov %r8,%r10 + mov %r9,%r11 + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm1,%xmm7 # t-rhead + + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = npi2 (int), xmm10 =rhead, xmm8 =rtail, r8 = region, r10 = region, r12 = Sign +# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail, r9 = region, r11 = region, r13 = Sign + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + +# NEW + + #ADDED + mov %r10,%rdi # npi2 in int + mov %r11,%rsi # npi2 in int + #ADDED + + shr $1,%r10 # 0 and 1 => 0 + shr $1,%r11 # 2 and 3 => 1 + + mov %r10,%rax + mov %r11,%rcx + + #ADDED + xor %r10,%rdi # xor last 2 bits of region for cos + xor %r11,%rsi # xor last 2 bits of region for cos + #ADDED + + not %r12 #~(sign) + not %r13 #~(sign) + and %r12,%r10 #region & ~(sign) + and %r13,%r11 #region & ~(sign) + + not %rax #~(region) + not %rcx #~(region) + not %r12 #~~(sign) + not %r13 #~~(sign) + and %r12,%rax #~region & ~~(sign) + and %r13,%rcx #~region & ~~(sign) + + #ADDED + and .L__reald_one_one(%rip),%rdi # sign for cos + and .L__reald_one_one(%rip),%rsi # sign for cos + #ADDED + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 # sign for sin + and .L__reald_one_one(%rip),%r11 # sign for sin + + + + + + + + mov %r10,%r12 + mov %r11,%r13 + + #ADDED + mov %rdi,%rax + mov %rsi,%rcx + #ADDED + + and .L__reald_one_zero(%rip),%r12 #mask out the lower sign bit leaving the upper sign bit + and .L__reald_one_zero(%rip),%r13 #mask out the lower sign bit leaving the upper sign bit + + #ADDED + and .L__reald_one_zero(%rip),%rax #mask out the lower sign bit leaving the upper sign bit + and .L__reald_one_zero(%rip),%rcx #mask out the lower sign bit leaving the upper sign bit + #ADDED + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + #ADDED + shl $63,%rdi #shift lower sign bit left by 63 bits + shl $63,%rsi #shift lower sign bit left by 63 bits + shl $31,%rax #shift upper sign bit left by 31 bits + shl $31,%rcx #shift upper sign bit left by 31 bits + #ADDED + + mov %r10,p_sign_sin(%rsp) #write out lower sign bit + mov %r12,p_sign_sin+8(%rsp) #write out upper sign bit + mov %r11,p_sign1_sin(%rsp) #write out lower sign bit + mov %r13,p_sign1_sin+8(%rsp) #write out upper sign bit + + mov %rdi,p_sign_cos(%rsp) #write out lower sign bit + mov %rax,p_sign_cos+8(%rsp) #write out upper sign bit + mov %rsi,p_sign1_cos(%rsp) #write out lower sign bit + mov %rcx,p_sign1_cos+8(%rsp) #write out upper sign bit + +# NEW + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead +# xmm4 = Sign, xmm10 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail + movapd %xmm10,%xmm6 # rhead + movapd %xmm1,%xmm7 # rhead + + subpd %xmm8,%xmm10 # r = rhead - rtail + subpd %xmm9,%xmm1 # r = rhead - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail + +# subpd %xmm10,%xmm6 ;rr=rhead-r +# subpd %xmm1,%xmm7 ;rr=rhead-r + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm10,%xmm2 # move r for r2 + movapd %xmm1,%xmm3 # move r for r2 + + mulpd %xmm10,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + +# subpd xmm6, xmm8 ;rr=(rhead-r) -rtail +# subpd xmm7, xmm9 ;rr=(rhead-r) -rtail + + and .L__reald_zero_one(%rip),%rax # region for jump table + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + +# HARSHA ADDED +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign_sin = Sign, p_sign_cos = Sign, xmm10 = r, xmm2 = r2 +# p_sign1_sin = Sign, p_sign1_cos = Sign, xmm1 = r, xmm3 = r2 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm14 # for x3 + movapd %xmm3,%xmm15 # for x3 + + movapd %xmm2,%xmm0 # for r + movapd %xmm3,%xmm11 # for r + + movdqa .Lcosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + + movapd .Lcosarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + movdqa .Lsinarray+0x30(%rip),%xmm6 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm7 # c4 + + movapd .Lsinarray+0x10(%rip),%xmm12 # c2 + movapd .Lsinarray+0x10(%rip),%xmm13 # c2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm10,%xmm14 # x3 + mulpd %xmm1,%xmm15 # x3 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm6 # c2*x2 + mulpd %xmm3,%xmm7 # c2*x2 + + mulpd %xmm2,%xmm12 # c4*x2 + mulpd %xmm3,%xmm13 # c4*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4 + + addpd .Lcosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcosarray(%rip),%xmm9 # c1+x2c2 + + addpd .Lsinarray+0x20(%rip),%xmm6 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm7 # c3+x2c4 + + addpd .Lsinarray(%rip),%xmm12 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm13 # c1+x2c2 + + mulpd %xmm2,%xmm4 # x4(c3+x2c4) + mulpd %xmm3,%xmm5 # x4(c3+x2c4) + + mulpd %xmm2,%xmm6 # x4(c3+x2c4) + mulpd %xmm3,%xmm7 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zc + + addpd %xmm12,%xmm6 # zs + addpd %xmm13,%xmm7 # zs + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + + mulpd %xmm14,%xmm6 # x3 * zs + mulpd %xmm15,%xmm7 # x3 * zs + + subpd %xmm0,%xmm4 # - (-t) + subpd %xmm11,%xmm5 # - (-t) + + addpd %xmm10,%xmm6 # +x + addpd %xmm1,%xmm7 # +x + +# HARSHA ADDED + + lea .Levensin_oddcos_tbl(%rip),%rcx + jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfirst_or_next3_arg_gt_5e5: +# %rcx,,%rax r8, r9 + + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Be sure not to use %xmm3,%xmm1 and xmm7 +# Use %xmm8,,%xmm5 xmm0, xmm12 +# %xmm11,,%xmm9 xmm13 + + + movlpd %xmm10,r(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm10,%xmm10 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm6,%xmm6 + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1 + cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm6,%xmm10 + subsd %xmm0,%xmm10 # xmm10 = r=(rhead-rtail) + subsd %xmm10,%xmm6 # rr=rhead-r + subsd %xmm0,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm10,r+8(%rsp) # store upper r + movlpd %xmm6,rr+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_lower_naninf + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea r(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrs4_sincosf_lower_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region(%rsp) # region =0 + +.align 16 +0: + + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + + movhlps %xmm10,%xmm6 #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_lower_naninf_of_both_gt_5e5 + + mov %rcx,p_temp(%rsp) #Save upper arg + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea r(%rsp),%rsi + +# added ins- changed input from xmm10 to xmm0 + movd %xmm10,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + mov p_temp(%rsp),%rcx #Restore upper arg + jmp 0f + +.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_upper_naninf_of_both_gt_5e5 + + + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea r+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm6,%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5: + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region+4(%rsp) #region = 0 + +.align 16 +0: + jmp .Lcheck_next2_args + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsecond_or_next2_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Restore xmm4 and %xmm3,,%xmm1 xmm7 +# Can use %xmm0,,%xmm8 xmm12 +# %xmm9,,%xmm5 xmm11, xmm13 + + movhpd %xmm10,r+8(%rsp) #Save upper fp arg for remainder_piby2 call + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1 + cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm1 = piby2_2 + cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store upper region + +# movsd %xmm6,%xmm10 +# subsd xmm10,xmm0 ; xmm10 = r=(rhead-rtail) +# subsd %xmm10,%xmm6 ; rr=rhead-r +# subsd xmm6, xmm0 ; xmm6 = rr=((rhead-r) -rtail) + + subsd %xmm0,%xmm6 # xmm10 = r=(rhead-rtail) + +# movlpd QWORD PTR r[rsp], xmm10 ; store upper r +# movlpd QWORD PTR rr[rsp], xmm6 ; store upper rr + + movlpd %xmm6,r(%rsp) # store upper r + + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_upper_naninf + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea r+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrs4_sincosf_upper_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region+4(%rsp) # region =0 + +.align 16 +0: + + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcheck_next2_args: + + mov $0x411E848000000000,%r10 #5e5 + + + cmp %r10,%r8 + jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfirst_second_done_fourth_arg_gt_5e5 + + + +# Work on next two args, both < 5e5 +# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5 + + movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi + addpd %xmm4,%xmm3 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + cvtdq2pd %xmm5,%xmm3 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm5,region1(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm1,%xmm7 # t-rhead + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +# movapd %xmm1,%xmm7 ; rhead + subpd %xmm9,%xmm1 # r = rhead - rtail + movapd %xmm1,r1(%rsp) + +# subpd %xmm1,%xmm7 ; rr=rhead-r +# subpd xmm7, xmm9 ; rr=(rhead-r) -rtail +# movapd OWORD PTR rr1[rsp], xmm7 + + jmp .L__vrs4_sincosf_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lthird_or_fourth_arg_gt_5e5: +#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Can use %xmm11,,%xmm9 xmm13 +# %xmm8,,%xmm5 xmm0, xmm12 +# Restore xmm4 + +# Work on first two args, both < 5e5 + + + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm4,region(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +# movapd %xmm10,%xmm6 ; rhead + subpd %xmm8,%xmm10 # r = rhead - rtail + movapd %xmm10,r(%rsp) + +# subpd %xmm10,%xmm6 ; rr=rhead-r +# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail +# movapd OWORD PTR rr[rsp], xmm6 + + +# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_third_or_fourth_arg_gt_5e5: +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + + + mov $0x411E848000000000,%r10 #5e5 + + cmp %r10,%r9 + jae .Lboth_arg_gt_5e5_higher + + +# Upper Arg is <5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg + movhlps %xmm3,%xmm3 + movhlps %xmm7,%xmm7 + + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r9d,region1+4(%rsp) # store upper region + + + subsd %xmm10,%xmm7 # xmm1 = r=(rhead-rtail) + + movlpd %xmm7,r1+8(%rsp) # store upper r + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_lower_naninf_higher + + lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf + lea r1(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r1(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_sincosf_lower_naninf_higher: + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region1(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrs4_sincosf_reconstruct + + + + + + + +.align 16 +.Lboth_arg_gt_5e5_higher: +# Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + + movhlps %xmm1,%xmm7 #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_lower_naninf_of_both_gt_5e5_higher + + mov %r9,p_temp1(%rsp) #Save upper arg + lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf + lea r1(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm1,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp1(%rsp),%r9 #Restore upper arg + + + jmp 0f + +.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region1(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_upper_naninf_of_both_gt_5e5_higher + + lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea r1+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm7,%rdi #Restore upper fp arg for remainder_piby2 call + + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5_higher: + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region1+4(%rsp) #region = 0 + +.align 16 +0: + + jmp .L__vrs4_sincosf_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfourth_arg_gt_5e5: +#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5 +#%rcx,,%rax r8, r9 +#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + +# Work on first two args, both < 5e5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm4,region(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +# movapd %xmm10,%xmm6 ; rhead + subpd %xmm8,%xmm10 # r = rhead - rtail + movapd %xmm10,r(%rsp) + +# subpd %xmm10,%xmm6 ; rr=rhead-r +# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail +# movapd OWORD PTR rr[rsp], xmm6 + + +# Work on next two args, third arg < 5e5, fourth arg >= 5e5 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_fourth_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call + + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r8d,region1(%rsp) # store lower region + +# movsd %xmm7,%xmm1 +# subsd xmm1, xmm10 ; xmm10 = r=(rhead-rtail) +# subsd %xmm1,%xmm7 ; rr=rhead-r +# subsd xmm7, xmm10 ; xmm6 = rr=((rhead-r) -rtail) + + subsd %xmm10,%xmm7 # xmm10 = r=(rhead-rtail) + +# movlpd QWORD PTR r1[rsp], xmm1 ; store upper r +# movlpd QWORD PTR rr1[rsp], xmm7 ; store upper rr + + movlpd %xmm7,r1(%rsp) # store upper r + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_upper_naninf_higher + + lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea r1+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r1+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_sincosf_upper_naninf_higher: + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region1+4(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrs4_sincosf_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrs4_sincosf_reconstruct: +#Results +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign_sin = Sign, ; p_sign_cos = Sign, xmm10 = r, xmm2 = r2 +# p_sign1_sin = Sign, ; p_sign1_cos = Sign, xmm1 = r, xmm3 = r2 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd r(%rsp),%xmm10 + movapd r1(%rsp),%xmm1 + + mov region(%rsp),%r8 + mov region1(%rsp),%r9 + + mov %r8,%r10 + mov %r9,%r11 + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + +# NEW + + #ADDED + mov %r10,%rdi + mov %r11,%rsi + #ADDED + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + mov %r10,%rax + mov %r11,%rcx + + #ADDED + xor %r10,%rdi + xor %r11,%rsi + #ADDED + + not %r12 #ADDED TO CHANGE THE LOGIC + not %r13 #ADDED TO CHANGE THE LOGIC + and %r12,%r10 + and %r13,%r11 + + not %rax + not %rcx + not %r12 + not %r13 + and %r12,%rax + and %r13,%rcx + + #ADDED + and .L__reald_one_one(%rip),%rdi #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%rsi #(~AB+A~B)&1 + #ADDED + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + + + + + + + mov %r10,%r12 + mov %r11,%r13 + + #ADDED + mov %rdi,%rax + mov %rsi,%rcx + #ADDED + + and .L__reald_one_zero(%rip),%r12 #mask out the lower sign bit leaving the upper sign bit + and .L__reald_one_zero(%rip),%r13 #mask out the lower sign bit leaving the upper sign bit + + #ADDED + and .L__reald_one_zero(%rip),%rax #mask out the lower sign bit leaving the upper sign bit + and .L__reald_one_zero(%rip),%rcx #mask out the lower sign bit leaving the upper sign bit + #ADDED + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + #ADDED + shl $63,%rdi #shift lower sign bit left by 63 bits + shl $63,%rsi #shift lower sign bit left by 63 bits + shl $31,%rax #shift upper sign bit left by 31 bits + shl $31,%rcx #shift upper sign bit left by 31 bits + #ADDED + + mov %r10,p_sign_sin(%rsp) #write out lower sign bit + mov %r12,p_sign_sin+8(%rsp) #write out upper sign bit + mov %r11,p_sign1_sin(%rsp) #write out lower sign bit + mov %r13,p_sign1_sin+8(%rsp) #write out upper sign bit + + mov %rdi,p_sign_cos(%rsp) #write out lower sign bit + mov %rax,p_sign_cos+8(%rsp) #write out upper sign bit + mov %rsi,p_sign1_cos(%rsp) #write out lower sign bit + mov %rcx,p_sign1_cos+8(%rsp) #write out upper sign bit +#NEW + + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm10,%xmm2 + movapd %xmm1,%xmm3 + + mulpd %xmm10,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + +# HARSHA ADDED +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign_cos = Sign, p_sign_sin = Sign, xmm10 = r, xmm2 = r2 +# p_sign1_cos = Sign, p_sign1_sin = Sign, xmm1 = r, xmm3 = r2 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm14 # for x3 + movapd %xmm3,%xmm15 # for x3 + + movapd %xmm2,%xmm0 # for r + movapd %xmm3,%xmm11 # for r + + movdqa .Lcosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + + movapd .Lcosarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + movdqa .Lsinarray+0x30(%rip),%xmm6 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm7 # c4 + + movapd .Lsinarray+0x10(%rip),%xmm12 # c2 + movapd .Lsinarray+0x10(%rip),%xmm13 # c2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm10,%xmm14 # x3 + mulpd %xmm1,%xmm15 # x3 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm6 # c2*x2 + mulpd %xmm3,%xmm7 # c2*x2 + + mulpd %xmm2,%xmm12 # c4*x2 + mulpd %xmm3,%xmm13 # c4*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4 + + addpd .Lcosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcosarray(%rip),%xmm9 # c1+x2c2 + + addpd .Lsinarray+0x20(%rip),%xmm6 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm7 # c3+x2c4 + + addpd .Lsinarray(%rip),%xmm12 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm13 # c1+x2c2 + + mulpd %xmm2,%xmm4 # x4(c3+x2c4) + mulpd %xmm3,%xmm5 # x4(c3+x2c4) + + mulpd %xmm2,%xmm6 # x4(c3+x2c4) + mulpd %xmm3,%xmm7 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zc + + addpd %xmm12,%xmm6 # zs + addpd %xmm13,%xmm7 # zs + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + + mulpd %xmm14,%xmm6 # x3 * zs + mulpd %xmm15,%xmm7 # x3 * zs + + subpd %xmm0,%xmm4 # - (-t) + subpd %xmm11,%xmm5 # - (-t) + + addpd %xmm10,%xmm6 # +x + addpd %xmm1,%xmm7 # +x + +# HARSHA ADDED + + + lea .Levensin_oddcos_tbl(%rip),%rcx + jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrs4_sincosf_cleanup: + + mov p_sin(%rsp),%rdi + mov p_cos(%rsp),%rsi + + movapd p_sign_cos(%rsp),%xmm10 + movapd p_sign1_cos(%rsp),%xmm1 + + + xorpd %xmm4,%xmm10 # Cos term (+) Sign + xorpd %xmm5,%xmm1 # Cos term (+) Sign + + cvtpd2ps %xmm10,%xmm0 + cvtpd2ps %xmm1,%xmm11 + + movapd p_sign_sin(%rsp),%xmm14 + movapd p_sign1_sin(%rsp),%xmm15 + + xorpd %xmm6,%xmm14 # Sin term (+) Sign + xorpd %xmm7,%xmm15 # Sin term (+) Sign + + cvtpd2ps %xmm14,%xmm12 + cvtpd2ps %xmm15,%xmm13 + + movlps %xmm0,(%rsi) # save the cos + movlps %xmm12,(%rdi) # save the sin + movlps %xmm11,8(%rsi) # save the cos + movlps %xmm13,8(%rdi) # save the sin + + mov save_r12(%rsp),%r12 # restore r12 + mov save_r13(%rsp),%r13 # restore r13 + + add $0x0248,%rsp + ret + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +.align 16 +.Lcoscos_coscos_piby4: +# Cos in %xmm5,%xmm4 +# Sin in %xmm7,%xmm6 +# Lower and Upper Even + + movapd %xmm4,%xmm8 + movapd %xmm5,%xmm9 + + movapd %xmm6,%xmm4 + movapd %xmm7,%xmm5 + + movapd %xmm8,%xmm6 + movapd %xmm9,%xmm7 + + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lcossin_cossin_piby4: + + movhlps %xmm5,%xmm9 + movhlps %xmm7,%xmm13 + + movlhps %xmm9,%xmm7 + movlhps %xmm13,%xmm5 + + movhlps %xmm4,%xmm8 + movhlps %xmm6,%xmm12 + + movlhps %xmm8,%xmm6 + movlhps %xmm12,%xmm4 + + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lsincos_cossin_piby4: + movsd %xmm5,%xmm9 + movsd %xmm7,%xmm13 + + movsd %xmm9,%xmm7 + movsd %xmm13,%xmm5 + + movhlps %xmm4,%xmm8 + movhlps %xmm6,%xmm12 + + movlhps %xmm8,%xmm6 + movlhps %xmm12,%xmm4 + + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lsincos_sincos_piby4: + movsd %xmm5,%xmm9 + movsd %xmm7,%xmm13 + + movsd %xmm9,%xmm7 + movsd %xmm13,%xmm5 + + movsd %xmm4,%xmm8 + movsd %xmm6,%xmm12 + + movsd %xmm8,%xmm6 + movsd %xmm12,%xmm4 + + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lcossin_sincos_piby4: + movhlps %xmm5,%xmm9 + movhlps %xmm7,%xmm13 + + movlhps %xmm9,%xmm7 + movlhps %xmm13,%xmm5 + + movsd %xmm4,%xmm8 + movsd %xmm6,%xmm12 + + movsd %xmm8,%xmm6 + movsd %xmm12,%xmm4 + + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lcoscos_sinsin_piby4: +# Cos in %xmm5,%xmm4 +# Sin in %xmm7,%xmm6 +# Lower even, Upper odd, Swap upper + + movapd %xmm5,%xmm9 + movapd %xmm7,%xmm5 + movapd %xmm9,%xmm7 + + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lsinsin_coscos_piby4: +# Cos in %xmm5,%xmm4 +# Sin in %xmm7,%xmm6 +# Lower odd, Upper even, Swap lower + + movapd %xmm4,%xmm8 + movapd %xmm6,%xmm4 + movapd %xmm8,%xmm6 + + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lcoscos_cossin_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + + movapd %xmm5,%xmm9 + movapd %xmm7,%xmm5 + movapd %xmm9,%xmm7 + + movhlps %xmm4,%xmm8 + movhlps %xmm6,%xmm12 + + movlhps %xmm8,%xmm6 + movlhps %xmm12,%xmm4 + + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lcoscos_sincos_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + + movapd %xmm5,%xmm9 + movapd %xmm7,%xmm5 + movapd %xmm9,%xmm7 + + movsd %xmm4,%xmm8 + movsd %xmm6,%xmm12 + + movsd %xmm8,%xmm6 + movsd %xmm12,%xmm4 + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lcossin_coscos_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + + movapd %xmm4,%xmm8 + movapd %xmm6,%xmm4 + movapd %xmm8,%xmm6 + + movhlps %xmm5,%xmm9 + movhlps %xmm7,%xmm13 + + movlhps %xmm9,%xmm7 + movlhps %xmm13,%xmm5 + + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lcossin_sinsin_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + movhlps %xmm5,%xmm9 + movhlps %xmm7,%xmm13 + + movlhps %xmm9,%xmm7 + movlhps %xmm13,%xmm5 + + jmp .L__vrs4_sincosf_cleanup + + +.align 16 +.Lsincos_coscos_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + movapd %xmm4,%xmm8 + movapd %xmm6,%xmm4 + movapd %xmm8,%xmm6 + + movsd %xmm5,%xmm9 + movsd %xmm7,%xmm13 + + movsd %xmm9,%xmm7 + movsd %xmm13,%xmm5 + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lsincos_sinsin_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + movsd %xmm5,%xmm9 + movsd %xmm7,%xmm5 + movsd %xmm9,%xmm7 + + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lsinsin_cossin_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + movhlps %xmm4,%xmm8 + movhlps %xmm6,%xmm12 + + movlhps %xmm8,%xmm6 + movlhps %xmm12,%xmm4 + + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lsinsin_sincos_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + movsd %xmm4,%xmm8 + movsd %xmm6,%xmm4 + movsd %xmm8,%xmm6 + jmp .L__vrs4_sincosf_cleanup + +.align 16 +.Lsinsin_sinsin_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 +# Lower and Upper odd, So Swap + + jmp .L__vrs4_sincosf_cleanup
diff --git a/src/gas/vrs4sinf.S b/src/gas/vrs4sinf.S new file mode 100644 index 0000000..3744f33 --- /dev/null +++ b/src/gas/vrs4sinf.S
@@ -0,0 +1,2171 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrs4sinf.s +# +# A vector implementation of the sin libm function. +# +# Prototype: +# +# __m128 __vrs4_sinf(__m128 x); +# +# Computes Sine of x for an array of input values. +# Places the results into the supplied y array. +# Does not perform error checking. +# Denormal inputs may produce unexpected results. +# This routine computes 4 single precision Sine values at a time. +# The four values are passed as packed single in xmm10. +# The four results are returned as packed singles in xmm10. +# Note that this represents a non-standard ABI usage, as no ABI +# ( and indeed C) currently allows returning 2 values for a function. +# It is expected that some compilers may be able to take advantage of this +# interface when implementing vectorized loops. Using the array implementation +# of the routine requires putting the inputs into memory, and retrieving +# the results from memory. This routine eliminates the need for this +# overhead if the data does not already reside in memory. +# Author: Harsha Jagasia +# Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 64 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # + +.Lcosarray: + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x03FA5555555502F31 + .quad 0x0BF56C16BF55699D7 # -0.00138889 c2 + .quad 0x0BF56C16BF55699D7 + .quad 0x03EFA015C50A93B49 # 2.48016e-005 c3 + .quad 0x03EFA015C50A93B49 + .quad 0x0BE92524743CC46B8 # -2.75573e-007 c4 + .quad 0x0BE92524743CC46B8 + +.Lsinarray: + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x0BFC555555545E87D + .quad 0x03F811110DF01232D # 0.00833333 s2 + .quad 0x03F811110DF01232D + .quad 0x0BF2A013A88A37196 # -0.000198413 s3 + .quad 0x0BF2A013A88A37196 + .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4 + .quad 0x03EC6DBE4AD1572D5 + +.Lsincosarray: + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x03F811110DF01232D # 0.00833333 s2 + .quad 0x0BF56C16BF55699D7 + .quad 0x0BF2A013A88A37196 # -0.000198413 s3 + .quad 0x03EFA015C50A93B49 + .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4 + .quad 0x0BE92524743CC46B8 + +.Lcossinarray: + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x0BF56C16BF55699D7 # c2 + .quad 0x03F811110DF01232D + .quad 0x03EFA015C50A93B49 # c3 + .quad 0x0BF2A013A88A37196 + .quad 0x0BE92524743CC46B8 # c4 + .quad 0x03EC6DBE4AD1572D5 + +.align 64 + .Levensin_oddcos_tbl: + + .quad .Lsinsin_sinsin_piby4 # 0 * ; Done + .quad .Lsinsin_sincos_piby4 # 1 + ; Done + .quad .Lsinsin_cossin_piby4 # 2 ; Done + .quad .Lsinsin_coscos_piby4 # 3 + ; Done + + .quad .Lsincos_sinsin_piby4 # 4 ; Done + .quad .Lsincos_sincos_piby4 # 5 * ; Done + .quad .Lsincos_cossin_piby4 # 6 ; Done + .quad .Lsincos_coscos_piby4 # 7 ; Done + + .quad .Lcossin_sinsin_piby4 # 8 ; Done + .quad .Lcossin_sincos_piby4 # 9 ; TBD + .quad .Lcossin_cossin_piby4 # 10 * ; Done + .quad .Lcossin_coscos_piby4 # 11 ; Done + + .quad .Lcoscos_sinsin_piby4 # 12 ; Done + .quad .Lcoscos_sincos_piby4 # 13 + ; Done + .quad .Lcoscos_cossin_piby4 # 14 ; Done + .quad .Lcoscos_coscos_piby4 # 15 * ; Done + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + .text + .align 16 + .p2align 4,,15 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# define local variable storage offsets +.equ p_temp,0 # temporary for get/put bits operation +.equ p_temp1,0x10 # temporary for get/put bits operation + +.equ save_xmm6,0x20 # temporary for get/put bits operation +.equ save_xmm7,0x30 # temporary for get/put bits operation +.equ save_xmm8,0x40 # temporary for get/put bits operation +.equ save_xmm9,0x50 # temporary for get/put bits operation +.equ save_xmm0,0x60 # temporary for get/put bits operation +.equ save_xmm11,0x70 # temporary for get/put bits operation +.equ save_xmm12,0x80 # temporary for get/put bits operation +.equ save_xmm13,0x90 # temporary for get/put bits operation +.equ save_xmm14,0x0A0 # temporary for get/put bits operation +.equ save_xmm15,0x0B0 # temporary for get/put bits operation + +.equ r,0x0C0 # pointer to r for remainder_piby2 +.equ rr,0x0D0 # pointer to r for remainder_piby2 +.equ region,0x0E0 # pointer to r for remainder_piby2 + +.equ r1,0x0F0 # pointer to r for remainder_piby2 +.equ rr1,0x0100 # pointer to r for remainder_piby2 +.equ region1,0x0110 # pointer to r for remainder_piby2 + +.equ p_temp2,0x0120 # temporary for get/put bits operation +.equ p_temp3,0x0130 # temporary for get/put bits operation + +.equ p_temp4,0x0140 # temporary for get/put bits operation +.equ p_temp5,0x0150 # temporary for get/put bits operation + +.equ p_original,0x0160 # original x +.equ p_mask,0x0170 # original x +.equ p_sign,0x0180 # original x + +.equ p_original1,0x0190 # original x +.equ p_mask1,0x01A0 # original x +.equ p_sign1,0x01B0 # original x + +.equ save_r12,0x01C0 # temporary for get/put bits operation +.equ save_r13,0x01D0 # temporary for get/put bits operation + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +.globl __vrs4_sinf + .type __vrs4_sinf,@function +__vrs4_sinf: + + sub $0x01E8,%rsp + + mov %r12,save_r12(%rsp) # save r12 + + mov %r13,save_r13(%rsp) # save r13 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#STARTMAIN + + movhlps %xmm0,%xmm8 + cvtps2pd %xmm0,%xmm10 # convert input to double. + cvtps2pd %xmm8,%xmm1 # convert input to double. + +movdqa %xmm10,%xmm6 +movdqa %xmm1,%xmm7 +movapd .L__real_7fffffffffffffff(%rip),%xmm2 + +andpd %xmm2,%xmm10 #Unsign +andpd %xmm2,%xmm1 #Unsign + +movd %xmm10,%rax #rax is lower arg +movhpd %xmm10, p_temp+8(%rsp) # +mov p_temp+8(%rsp),%rcx #rcx = upper arg + +movd %xmm1,%r8 #r8 is lower arg +movhpd %xmm1, p_temp1+8(%rsp) # +mov p_temp1+8(%rsp),%r9 #r9 = upper arg + +movdqa %xmm10,%xmm12 +movdqa %xmm1,%xmm13 + +pcmpgtd %xmm6,%xmm12 +pcmpgtd %xmm7,%xmm13 +movdqa %xmm12,%xmm6 +movdqa %xmm13,%xmm7 +psrldq $4,%xmm12 +psrldq $4,%xmm13 +psrldq $8,%xmm6 +psrldq $8,%xmm7 + +mov $0x3FE921FB54442D18,%rdx #piby4 + +mov $0x411E848000000000,%r10 #5e5 + +movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + + +por %xmm6,%xmm12 +por %xmm7,%xmm13 +movd %xmm12,%r12 #Move Sign to gpr ** +movd %xmm13,%r13 #Move Sign to gpr ** + +movapd %xmm10,%xmm2 #x0 +movapd %xmm1,%xmm3 #x1 +movapd %xmm10,%xmm6 #x0 +movapd %xmm1,%xmm7 #x1 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm2 = x, xmm4 =0.5/t, xmm6 =x +# xmm3 = x, xmm5 =0.5/t, xmm7 =x +.align 16 +.Leither_or_both_arg_gt_than_piby4: + cmp %r10,%rax + jae .Lfirst_or_next3_arg_gt_5e5 + + cmp %r10,%rcx + jae .Lsecond_or_next2_arg_gt_5e5 + + cmp %r10,%r8 + jae .Lthird_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfourth_arg_gt_5e5 + + +# /* Find out what multiple of piby2 */ +# npi2 = (int)(x * twobypi + 0.5); + movapd .L__real_3fe45f306dc9c883(%rip),%xmm10 + mulpd %xmm10,%xmm2 # * twobypi + mulpd %xmm10,%xmm3 # * twobypi + + addpd %xmm4,%xmm2 # +0.5, npi2 + addpd %xmm4,%xmm3 # +0.5, npi2 + + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + + cvtdq2pd %xmm4,%xmm2 # and back to double. + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + + movd %xmm4,%r8 # Region + movd %xmm5,%r9 # Region + + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + mov %r8,%r10 + mov %r9,%r11 + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm1,%xmm7 # t-rhead + + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = npi2 (int), xmm10 =rhead, xmm8 =rtail +# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + mov %r10,%rax + mov %r11,%rcx + + not %r12 #ADDED TO CHANGE THE LOGIC + not %r13 #ADDED TO CHANGE THE LOGIC + and %r12,%r10 + and %r13,%r11 + + not %rax + not %rcx + not %r12 + not %r13 + and %r12,%rax + and %r13,%rcx + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r12,p_sign+8(%rsp) #write out upper sign bit + mov %r11,p_sign1(%rsp) #write out lower sign bit + mov %r13,p_sign1+8(%rsp) #write out upper sign bit + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead +# xmm4 = Sign, xmm10 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail + movapd %xmm10,%xmm6 # rhead + movapd %xmm1,%xmm7 # rhead + + subpd %xmm8,%xmm10 # r = rhead - rtail + subpd %xmm9,%xmm1 # r = rhead - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm10,%xmm2 # move r for r2 + movapd %xmm1,%xmm3 # move r for r2 + + mulpd %xmm10,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + lea .Levensin_oddcos_tbl(%rip),%rcx + jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + + + + + + + + + + + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfirst_or_next3_arg_gt_5e5: +# %rcx,,%rax r8, r9 + + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Be sure not to use %xmm3,%xmm1 and xmm7 +# Use %xmm8,,%xmm5 xmm0, xmm12 +# %xmm11,,%xmm9 xmm13 + + + movlpd %xmm10,r(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm10,%xmm10 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm6,%xmm6 + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1 + cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm6,%xmm10 + subsd %xmm0,%xmm10 # xmm10 = r=(rhead-rtail) + movlpd %xmm10,r+8(%rsp) # store upper r + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_lower_naninf + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea r(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrs4_sinf_lower_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region(%rsp) # region =0 + +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + + movhlps %xmm10,%xmm6 #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_lower_naninf_of_both_gt_5e5 + + mov %rcx,p_temp(%rsp) #Save upper arg + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea r(%rsp),%rsi + +# added ins- changed input from xmm10 to xmm0 + movd %xmm10,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + mov p_temp(%rsp),%rcx #Restore upper arg + jmp 0f + +.L__vrs4_sinf_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf +# mov .LQWORD,%rax PTR p_original[rsp] + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_upper_naninf_of_both_gt_5e5 + + + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea r+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm6,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrs4_sinf_upper_naninf_of_both_gt_5e5: + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region+4(%rsp) #region = 0 + +.align 16 +0: + jmp .Lcheck_next2_args + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsecond_or_next2_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Restore xmm4 and %xmm3,,%xmm1 xmm7 +# Can use %xmm0,,%xmm8 xmm12 +# %xmm9,,%xmm5 xmm11, xmm13 + + movhpd %xmm10,r+8(%rsp) #Save upper fp arg for remainder_piby2 call + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1 + cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm1 = piby2_2 + cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store upper region + + subsd %xmm0,%xmm6 # xmm10 = r=(rhead-rtail) + + movlpd %xmm6,r(%rsp) # store upper r + + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_upper_naninf + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea r+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r+8(%rsp),%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrs4_sinf_upper_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region+4(%rsp) # region =0 + +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcheck_next2_args: + + mov $0x411E848000000000,%r10 #5e5 + + + cmp %r10,%r8 + jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfirst_second_done_fourth_arg_gt_5e5 + + + +# Work on next two args, both < 5e5 +# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5 + + movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi + addpd %xmm4,%xmm3 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + cvtdq2pd %xmm5,%xmm3 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm5,region1(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm1,%xmm7 # t-rhead + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +# movapd %xmm1,%xmm7 ; rhead + subpd %xmm9,%xmm1 # r = rhead - rtail + movapd %xmm1,r1(%rsp) + +# subpd %xmm1,%xmm7 ; rr=rhead-r +# subpd xmm7, xmm9 ; rr=(rhead-r) -rtail +# movapd OWORD PTR rr1[rsp], xmm7 + + jmp .L__vrs4_sinf_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lthird_or_fourth_arg_gt_5e5: +#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Can use %xmm11,,%xmm9 xmm13 +# %xmm8,,%xmm5 xmm0, xmm12 +# Restore xmm4 + +# Work on first two args, both < 5e5 + + + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm4,region(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +# movapd %xmm10,%xmm6 ; rhead + subpd %xmm8,%xmm10 # r = rhead - rtail + movapd %xmm10,r(%rsp) + +# subpd %xmm10,%xmm6 ; rr=rhead-r +# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail +# movapd OWORD PTR rr[rsp], xmm6 + + +# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_third_or_fourth_arg_gt_5e5: +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + + + mov $0x411E848000000000,%r10 #5e5 + + cmp %r10,%r9 + jae .Lboth_arg_gt_5e5_higher + + +# Upper Arg is <5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg + movhlps %xmm3,%xmm3 + movhlps %xmm7,%xmm7 + + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r9d,region1+4(%rsp) # store upper region + + subsd %xmm10,%xmm7 # xmm1 = r=(rhead-rtail) + + movlpd %xmm7,r1+8(%rsp) # store upper r + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_lower_naninf_higher + + lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf + lea r1(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r1(%rsp),%rdi + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_sinf_lower_naninf_higher: + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region1(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrs4_sinf_reconstruct + + + + + + + +.align 16 +.Lboth_arg_gt_5e5_higher: +# Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + + movhlps %xmm1,%xmm7 #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_lower_naninf_of_both_gt_5e5_higher + + mov %r9,p_temp1(%rsp) #Save upper arg + lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf + lea r1(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm1,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp1(%rsp),%r9 #Restore upper arg + + jmp 0f + +.L__vrs4_sinf_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region1(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_upper_naninf_of_both_gt_5e5_higher + + lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea r1+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm7,%rdi + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_sinf_upper_naninf_of_both_gt_5e5_higher: + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region1+4(%rsp) #region = 0 + +.align 16 +0: + jmp .L__vrs4_sinf_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfourth_arg_gt_5e5: +#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5 +#%rcx,,%rax r8, r9 +#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + +# Work on first two args, both < 5e5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm4,region(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +# movapd %xmm10,%xmm6 ; rhead + subpd %xmm8,%xmm10 # r = rhead - rtail + movapd %xmm10,r(%rsp) + +# subpd %xmm10,%xmm6 ; rr=rhead-r +# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail +# movapd OWORD PTR rr[rsp], xmm6 + + +# Work on next two args, third arg < 5e5, fourth arg >= 5e5 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_fourth_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r8d,region1(%rsp) # store lower region + +# movsd %xmm7,%xmm1 +# subsd xmm1, xmm10 ; xmm10 = r=(rhead-rtail) +# subsd %xmm1,%xmm7 ; rr=rhead-r +# subsd xmm7, xmm10 ; xmm6 = rr=((rhead-r) -rtail) + + subsd %xmm10,%xmm7 # xmm10 = r=(rhead-rtail) + +# movlpd QWORD PTR r1[rsp], xmm1 ; store upper r +# movlpd QWORD PTR rr1[rsp], xmm7 ; store upper rr + + movlpd %xmm7,r1(%rsp) # store upper r + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_upper_naninf_higher + + lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea r1+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r1+8(%rsp),%rdi + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_sinf_upper_naninf_higher: + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region1+4(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrs4_sinf_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrs4_sinf_reconstruct: +#Results +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd r(%rsp),%xmm10 + movapd r1(%rsp),%xmm1 + + mov region(%rsp),%r8 + mov region1(%rsp),%r9 + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + + mov %r8,%r10 + mov %r9,%r11 + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + mov %r10,%rax + mov %r11,%rcx + + not %r12 #ADDED TO CHANGE THE LOGIC + not %r13 #ADDED TO CHANGE THE LOGIC + and %r12,%r10 + and %r13,%r11 + + not %rax + not %rcx + not %r12 + not %r13 + and %r12,%rax + and %r13,%rcx + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r12,p_sign+8(%rsp) #write out upper sign bit + mov %r11,p_sign1(%rsp) #write out lower sign bit + mov %r13,p_sign1+8(%rsp) #write out upper sign bit + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm10,%xmm2 + movapd %xmm1,%xmm3 + + mulpd %xmm10,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + + lea .Levensin_oddcos_tbl(%rip),%rcx + jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrs4_sinf_cleanup: + + movapd p_sign(%rsp),%xmm10 + movapd p_sign1(%rsp),%xmm1 + + xorpd %xmm4,%xmm10 # (+) Sign + xorpd %xmm5,%xmm1 # (+) Sign + + cvtpd2ps %xmm10,%xmm0 + cvtpd2ps %xmm1,%xmm11 + movlhps %xmm11,%xmm0 + + mov save_r12(%rsp),%r12 # restore r12 + mov save_r13(%rsp),%r13 # restore r13 + + add $0x01E8,%rsp + ret + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_coscos_piby4: + movapd %xmm2,%xmm0 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + + movapd .Lcosarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4 + + addpd .Lcosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcosarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm2,%xmm4 # x4(c3+x2c4) + mulpd %xmm3,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + + subpd %xmm0,%xmm4 # + t + subpd %xmm11,%xmm5 # + t + + jmp .L__vrs4_sinf_cleanup + +.align 16 +.Lcossin_cossin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movdqa .Lsincosarray+0x30(%rip),%xmm4 # s4 + movdqa .Lsincosarray+0x30(%rip),%xmm5 # s4 + movapd .Lsincosarray+0x10(%rip),%xmm8 # s2 + movapd .Lsincosarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s4 + mulpd %xmm3,%xmm5 # x2s4 + mulpd %xmm2,%xmm8 # x2s2 + mulpd %xmm3,%xmm9 # x2s2 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # s4+x2s3 + addpd .Lsincosarray+0x20(%rip),%xmm5 # s4+x2s3 + addpd .Lsincosarray(%rip),%xmm8 # s2+x2s1 + addpd .Lsincosarray(%rip),%xmm9 # s2+x2s1 + + mulpd %xmm0,%xmm4 # x4(s3+x2s4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + movhlps %xmm0,%xmm0 # move high x4 for cos term + movhlps %xmm11,%xmm11 # move high x4 for cos term + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term + movsd %xmm3,%xmm7 # move low x2 for x3 for sin term + mulsd %xmm10,%xmm6 # get low x3 for sin term + mulsd %xmm1,%xmm7 # get low x3 for sin term + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high r for cos + movhlps %xmm3,%xmm13 # move high r for cos + + movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos + movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm7,%xmm5 # sin *x3 + + mulsd %xmm0,%xmm8 # cos *x4 + mulsd %xmm11,%xmm9 # cos *x4 + + subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0 + + addsd %xmm10,%xmm4 # sin + x + addsd %xmm1,%xmm5 # sin + x + subsd %xmm12,%xmm8 # cos+t + subsd %xmm13,%xmm9 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + + jmp .L__vrs4_sinf_cleanup +.align 16 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lsincos_cossin_piby4: + + movapd .Lsincosarray+0x30(%rip),%xmm4 # s4 + movapd .Lcossinarray+0x30(%rip),%xmm5 # s4 + movdqa .Lsincosarray+0x10(%rip),%xmm8 # s2 + movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + movapd %xmm3,%xmm7 # sincos term upper x2 for x3 + + mulpd %xmm2,%xmm4 # x2s4 + mulpd %xmm3,%xmm5 # x2s4 + mulpd %xmm2,%xmm8 # x2s2 + mulpd %xmm3,%xmm9 # x2s2 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # s3+x2s4 + addpd .Lcossinarray+0x20(%rip),%xmm5 # s3+x2s4 + addpd .Lsincosarray(%rip),%xmm8 # s1+x2s2 + addpd .Lcossinarray(%rip),%xmm9 # s1+x2s2 + + mulpd %xmm0,%xmm4 # x4(s3+x2s4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + movhlps %xmm0,%xmm0 # move high x4 for cos term + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin) + mulpd %xmm1,%xmm7 + + mulsd %xmm10,%xmm6 # get low x3 for sin term (cossin) + movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos) + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + + movhlps %xmm2,%xmm12 # move high r for cos (cossin) + + + movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin) + movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos) + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm11,%xmm5 # cos *x4 + mulsd %xmm0,%xmm8 # cos *x4 + mulsd %xmm7,%xmm9 # sin *x3 + + subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0 + + movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos) + + addsd %xmm10,%xmm4 # sin + x + + addsd %xmm11,%xmm9 # sin + x + + + subsd %xmm12,%xmm8 # cos+t + subsd %xmm3,%xmm5 # cos+t + + movlhps %xmm8,%xmm4 # cossin + movlhps %xmm9,%xmm5 # sincos + + jmp .L__vrs4_sinf_cleanup + +.align 16 +.Lsincos_sincos_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd .Lcossinarray+0x30(%rip),%xmm4 # s4 + movapd .Lcossinarray+0x30(%rip),%xmm5 # s4 + movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2 + movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + movapd %xmm2,%xmm6 # move x2 for x4 + movapd %xmm3,%xmm7 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # s4+x2s3 + addpd .Lcossinarray+0x20(%rip),%xmm5 # s4+x2s3 + addpd .Lcossinarray(%rip),%xmm8 # s2+x2s1 + addpd .Lcossinarray(%rip),%xmm9 # s2+x2s1 + + mulpd %xmm0,%xmm4 # x4(s4+x2s3) + mulpd %xmm11,%xmm5 # x4(s4+x2s3) + + mulpd %xmm10,%xmm6 # get low x3 for sin term + mulpd %xmm1,%xmm7 # get low x3 for sin term + movhlps %xmm6,%xmm6 # move low x2 for x3 for sin term + movhlps %xmm7,%xmm7 # move low x2 for x3 for sin term + + mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos terms + mulsd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm4,%xmm12 # xmm8 = sin , xmm4 = cos + movhlps %xmm5,%xmm13 # xmm9 = sin , xmm5 = cos + + mulsd %xmm6,%xmm12 # sin *x3 + mulsd %xmm7,%xmm13 # sin *x3 + mulsd %xmm0,%xmm4 # cos *x4 + mulsd %xmm11,%xmm5 # cos *x4 + + subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 #-t=r-1.0 + + movhlps %xmm10,%xmm0 # move high x for x for sin term + movhlps %xmm1,%xmm11 # move high x for x for sin term + # Reverse 10 and 0 + + addsd %xmm0,%xmm12 # sin + x + addsd %xmm11,%xmm13 # sin + x + + subsd %xmm2,%xmm4 # cos+t + subsd %xmm3,%xmm5 # cos+t + + movlhps %xmm12,%xmm4 + movlhps %xmm13,%xmm5 + jmp .L__vrs4_sinf_cleanup + +.align 16 +.Lcossin_sincos_piby4: + + movapd .Lcossinarray+0x30(%rip),%xmm4 # s4 + movapd .Lsincosarray+0x30(%rip),%xmm5 # s4 + movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2 + movdqa .Lsincosarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + movapd %xmm2,%xmm7 # upper x2 for x3 for sin term (sincos) + + mulpd %xmm2,%xmm4 # x2s4 + mulpd %xmm3,%xmm5 # x2s4 + mulpd %xmm2,%xmm8 # x2s2 + mulpd %xmm3,%xmm9 # x2s2 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # s3+x2s4 + addpd .Lsincosarray+0x20(%rip),%xmm5 # s3+x2s4 + addpd .Lcossinarray(%rip),%xmm8 # s1+x2s2 + addpd .Lsincosarray(%rip),%xmm9 # s1+x2s2 + + mulpd %xmm0,%xmm4 # x4(s3+x2s4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + movhlps %xmm11,%xmm11 # move high x4 for cos term + + movsd %xmm3,%xmm6 # move low x2 for x3 for sin term (cossin) + mulpd %xmm10,%xmm7 + + mulsd %xmm1,%xmm6 # get low x3 for sin term (cossin) + movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos) + + mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + + movhlps %xmm3,%xmm12 # move high r for cos (cossin) + + + movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos (sincos) + movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin (cossin) + + mulsd %xmm0,%xmm4 # cos *x4 + mulsd %xmm6,%xmm5 # sin *x3 + mulsd %xmm7,%xmm8 # sin *x3 + mulsd %xmm11,%xmm9 # cos *x4 + + subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0 + + movhlps %xmm10,%xmm11 # move high x for x for sin term (sincos) + + subsd %xmm2,%xmm4 # cos-(-t) + subsd %xmm12,%xmm9 # cos-(-t) + + addsd %xmm11,%xmm8 # sin + x + addsd %xmm1,%xmm5 # sin + x + + movlhps %xmm8,%xmm4 # cossin + movlhps %xmm9,%xmm5 # sincos + + jmp .L__vrs4_sinf_cleanup +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_sinsin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: SIN +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: COS +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm0 # x2 ; SIN + movapd %xmm3,%xmm11 # x2 ; COS + movapd %xmm3,%xmm1 # copy of x2 for x4 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm0 # x4 + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm3,%xmm1 # x4 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcosarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm10,%xmm2 # x3 + + mulpd %xmm0,%xmm4 # x4(c3+x2c4) + mulpd %xmm1,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm1,%xmm5 # x4 * zc + + addpd %xmm10,%xmm4 # +x + subpd %xmm11,%xmm5 # +t + + jmp .L__vrs4_sinf_cleanup + +.align 16 +.Lsinsin_coscos_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: COS +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: SIN +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm0 # x2 ; COS + movapd %xmm3,%xmm11 # x2 ; SIN + movapd %xmm2,%xmm10 # copy of x2 for x4 + + movdqa .Lcosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # s4 + movapd .Lcosarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # s2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # s4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # s2*x2 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # s3+x2c4 + addpd .Lcosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # s1+x2c2 + + mulpd %xmm1,%xmm3 # x3 + mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm10,%xmm4 # x4(c3+x2c4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zs + + mulpd %xmm10,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x3 * zc + + subpd %xmm0,%xmm4 # +t + addpd %xmm1,%xmm5 # +x + + jmp .L__vrs4_sinf_cleanup +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_cossin_piby4: #Derive from cossin_coscos + movhlps %xmm2,%xmm0 # x2 for 0.5x2 for upper cos + movsd %xmm2,%xmm6 # lower x2 for x3 for lower sin + movapd %xmm3,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + movapd .Lsincosarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lsincosarray(%rip),%xmm8 # c2+x2c1 + addpd .Lcosarray(%rip),%xmm9 # c2+x2c1 + + + movapd %xmm12,%xmm2 # upper=x4 + movsd %xmm6,%xmm2 # lower=x2 + mulsd %xmm10,%xmm2 # lower=x2*x + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # upper= x4 * zc + # lower=x3 * zs + mulpd %xmm13,%xmm5 # x4 * zc + + + movlhps %xmm7,%xmm10 # + addpd %xmm10,%xmm4 # +x for lower sin, +t for upper cos + subpd %xmm11,%xmm5 # -(-t) + + jmp .L__vrs4_sinf_cleanup +.align 16 +.Lcoscos_sincos_piby4: #Derive from cossin_coscos + movsd %xmm2,%xmm0 # x2 for 0.5x2 for lower cos + movapd %xmm3,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lcossinarray+0x30(%rip),%xmm4 # cs4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + movapd .Lcossinarray+0x10(%rip),%xmm8 # cs2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lcossinarray(%rip),%xmm8 # c2+x2c1 + addpd .Lcosarray(%rip),%xmm9 # c2+x2c1 + + mulpd %xmm10,%xmm2 # upper=x3 for sin + mulsd %xmm10,%xmm2 # lower=x4 for cos + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # lower= x4 * zc + # upper= x3 * zs + mulpd %xmm13,%xmm5 # x4 * zc + + + movsd %xmm7,%xmm10 + addpd %xmm10,%xmm4 # +x for upper sin, +t for lower cos + subpd %xmm11,%xmm5 # -(-t) + + jmp .L__vrs4_sinf_cleanup +.align 16 +.Lcossin_coscos_piby4: + movhlps %xmm3,%xmm0 # x2 for 0.5x2 for upper cos + movapd %xmm2,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + movsd %xmm3,%xmm6 # lower x2 for x3 for sin + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4 + movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4 + movapd .Lcosarray+0x10(%rip),%xmm8 # cs2 + movapd .Lsincosarray+0x10(%rip),%xmm9 # c2 + + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lsincosarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lcosarray(%rip),%xmm8 # c2+x2c1 + addpd .Lsincosarray(%rip),%xmm9 # c2+x2c1 + + movapd %xmm13,%xmm3 # upper=x4 + movsd %xmm6,%xmm3 # lower x2 + mulsd %xmm1,%xmm3 # lower x2*x + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm12,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # upper= x4 * zc + # lower=x3 * zs + + movlhps %xmm7,%xmm1 + addpd %xmm1,%xmm5 # +x for lower sin, +t for upper cos + subpd %xmm11,%xmm4 # -(-t) + + jmp .L__vrs4_sinf_cleanup + +.align 16 +.Lcossin_sinsin_piby4: # Derived from sincos_coscos + + movhlps %xmm3,%xmm0 # x2 + movapd %xmm3,%xmm7 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4 + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lsincosarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsincosarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsincosarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm10,%xmm2 # x3 + movapd %xmm13,%xmm3 # upper x4 for cos + movsd %xmm7,%xmm3 # lower x2 for sin + mulsd %xmm1,%xmm3 # lower x3=x2*x for sin + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm11,%xmm1 # t for upper cos and x for lower sin + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zs + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # upper=x4 * zc + # lower=x3 * zs + + addpd %xmm10,%xmm4 # +x + addpd %xmm1,%xmm5 # +t upper, +x lower + + + jmp .L__vrs4_sinf_cleanup +.align 16 +.Lsincos_coscos_piby4: + movsd %xmm3,%xmm0 # x2 for 0.5x2 for lower cos + movapd %xmm2,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4 + movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4 + movapd .Lcosarray+0x10(%rip),%xmm8 # cs2 + movapd .Lcossinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lcossinarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lcosarray(%rip),%xmm8 # c2+x2c1 + addpd .Lcossinarray(%rip),%xmm9 # c2+x2c1 + + mulpd %xmm1,%xmm3 # upper=x3 for sin + mulsd %xmm1,%xmm3 # lower=x4 for cos + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm12,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # lower= x4 * zc + # upper= x3 * zs + + movsd %xmm7,%xmm1 + subpd %xmm11,%xmm4 # -(-t) + addpd %xmm1,%xmm5 # +x for upper sin, +t for lower cos + + + jmp .L__vrs4_sinf_cleanup + +.align 16 +.Lsincos_sinsin_piby4: # Derived from sincos_coscos + + movsd %xmm3,%xmm0 # x2 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4 + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lcossinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcossinarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcossinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm10,%xmm2 # x3 + mulpd %xmm1,%xmm3 # upper x3 for sin + mulsd %xmm1,%xmm3 # lower x4 for cos + + movhlps %xmm1,%xmm6 + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm6,%xmm11 # upper =t ; lower =x + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + addpd %xmm10,%xmm4 # +x + addpd %xmm11,%xmm5 # +t lower, +x upper + + jmp .L__vrs4_sinf_cleanup + +.align 16 +.Lsinsin_cossin_piby4: # Derived from sincos_coscos + + movhlps %xmm2,%xmm0 # x2 + movapd %xmm2,%xmm7 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # c4 + movapd .Lsincosarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsincosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm1,%xmm3 # x3 + movapd %xmm12,%xmm2 # upper x4 for cos + movsd %xmm7,%xmm2 # lower x2 for sin + mulsd %xmm10,%xmm2 # lower x3=x2*x for sin + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm11,%xmm10 # t for upper cos and x for lower sin + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # upper=x4 * zc + # lower=x3 * zs + + addpd %xmm1,%xmm5 # +x + addpd %xmm10,%xmm4 # +t upper, +x lower + + jmp .L__vrs4_sinf_cleanup + +.align 16 +.Lsinsin_sincos_piby4: # Derived from sincos_coscos + + movsd %xmm2,%xmm0 # x2 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lcossinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # c4 + movapd .Lcossinarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lcossinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm1,%xmm3 # x3 + mulpd %xmm10,%xmm2 # upper x3 for sin + mulsd %xmm10,%xmm2 # lower x4 for cos + + movhlps %xmm10,%xmm6 + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm6,%xmm11 + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + + addpd %xmm1,%xmm5 # +x + addpd %xmm11,%xmm4 # +t lower, +x upper + + jmp .L__vrs4_sinf_cleanup + +.align 16 +.Lsinsin_sinsin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + #x2 = x * x; + #(x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4)))); + + #x + x3 * ((c1 + x2 *c2) + x4 * (c3 + x2 * c4)); + + + movapd %xmm2,%xmm0 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # c4 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # c2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4 + + mulpd %xmm10,%xmm2 # x3 + mulpd %xmm1,%xmm3 # x3 + + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm0,%xmm4 # x4(c3+x2c4) + mulpd %xmm11,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zs + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # x3 * zs + + addpd %xmm10,%xmm4 # +x + addpd %xmm1,%xmm5 # +x + + jmp .L__vrs4_sinf_cleanup
diff --git a/src/gas/vrs8expf.S b/src/gas/vrs8expf.S new file mode 100644 index 0000000..b2eb597 --- /dev/null +++ b/src/gas/vrs8expf.S
@@ -0,0 +1,618 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrs8expf.s +# +# A vector implementation of the expf libm function. +# +# Prototype: +# +# void vs_expf(int n, float *x, float *y); +# +# Computes e raised to the x power for a eight packed single values. +# Places the results into xmm0 an xmm1. +# This routine implemented in single precision. It is slightly +# less accurate than the double precision version, but it will +# be better for vectorizing. +# Does not perform error handling, but does return C99 values for error +# inputs. Denormal results are truncated to 0. + +# This array version is basically a unrolling of the by4 scalar single +# routine. The second set of operations is performed by the indented +# instructions interleaved into the first set. +# The scheduling is done by trial and error. The resulting code represents +# the best time of many variations. It would seem more interleaving could +# be done, as there is a long stretch of the second computation that is not +# interleaved. But moving any of this code forward makes the routine +# slower. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + +# define local variable storage offsets +.equ p_ux,0x00 #qword +.equ p_ux2,0x010 #qword + +.equ save_xa,0x020 #qword +.equ save_ya,0x028 #qword +.equ save_nv,0x030 #qword + + +.equ p_iter,0x038 # qword storage for number of loop iterations + +.equ p_j,0x040 # second temporary for get/put bits operation +.equ p_m,0x050 #qword +.equ p_j2,0x060 # second temporary for exponent multiply +.equ p_m2,0x070 #qword +.equ save_rbx,0x080 #qword + + +.equ stack_size,0x098 + + +# parameters passed by gcc as: +# xmm0 - __m128d x1 +# xmm1 - __m128d x2 + + .text + .align 16 + .p2align 4,,15 +.globl __vrs8_expf + .type __vrs8_expf,@function +__vrs8_expf: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) + + +# Process the array 8 values at a time. + + movaps .L__real_thirtytwo_by_log2(%rip),%xmm3 # + + movaps %xmm0,p_ux(%rsp) + maxps .L__real_m8192(%rip),%xmm0 + movaps %xmm1,p_ux2(%rsp) + maxps .L__real_m8192(%rip),%xmm1 + movaps %xmm1,%xmm6 +# /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */ +# Step 1. Reduce the argument. + # r = x * thirtytwo_by_logbaseof2; + movaps .L__real_thirtytwo_by_log2(%rip),%xmm2 # + + mulps %xmm0,%xmm2 + xor %rax,%rax + minps .L__real_8192(%rip),%xmm2 + movaps .L__real_thirtytwo_by_log2(%rip),%xmm5 # + + mulps %xmm6,%xmm5 + minps .L__real_8192(%rip),%xmm5 # protect against large input values + +# /* Set n = nearest integer to r */ + cvtps2dq %xmm2,%xmm3 + lea .L__two_to_jby32_table(%rip),%rdi + cvtdq2ps %xmm3,%xmm1 + + cvtps2dq %xmm5,%xmm8 + cvtdq2ps %xmm8,%xmm7 +# r1 = x - n * logbaseof2_by_32_lead; + movaps .L__real_log2_by_32_head(%rip),%xmm2 + mulps %xmm1,%xmm2 + subps %xmm2,%xmm0 # r1 in xmm0, + + movaps .L__real_log2_by_32_head(%rip),%xmm5 + mulps %xmm7,%xmm5 + subps %xmm5,%xmm6 # r1 in xmm6, + + +# r2 = - n * logbaseof2_by_32_lead; + mulps .L__real_log2_by_32_tail(%rip),%xmm1 + mulps .L__real_log2_by_32_tail(%rip),%xmm7 + +# j = n & 0x0000001f; + movdqa %xmm3,%xmm4 + movdqa .L__int_mask_1f(%rip),%xmm2 + movdqa %xmm8,%xmm9 + movdqa .L__int_mask_1f(%rip),%xmm5 + pand %xmm4,%xmm2 + movdqa %xmm2,p_j(%rsp) +# f1 = two_to_jby32_lead_table[j); + + pand %xmm9,%xmm5 + movdqa %xmm5,p_j2(%rsp) + +# *m = (n - j) / 32; + psubd %xmm2,%xmm4 + psrad $5,%xmm4 + movdqa %xmm4,p_m(%rsp) + psubd %xmm5,%xmm9 + psrad $5,%xmm9 + movdqa %xmm9,p_m2(%rsp) + + movaps %xmm0,%xmm3 + addps %xmm1,%xmm3 # r = r1+ r2 + + mov p_j(%rsp),%eax # get an individual index + + movaps %xmm6,%xmm8 + mov (%rdi,%rax,4),%edx # get the f1 value + addps %xmm7,%xmm8 # r = r1+ r2 + mov %edx,p_j(%rsp) # save the f1 value + + +# Step 2. Compute the polynomial. +# q = r1 + +# r*r*( 5.00000000000000008883e-01 + +# r*( 1.66666666665260878863e-01 + +# r*( 4.16666666662260795726e-02 + +# r*( 8.33336798434219616221e-03 + +# r*( 1.38889490863777199667e-03 ))))); +# q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720 +# q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision + movaps %xmm3,%xmm4 + movaps %xmm3,%xmm2 + mulps %xmm2,%xmm2 # x*x + mulps .L__real_1_24(%rip),%xmm4 # /24 + + mov p_j+4(%rsp),%eax # get an individual index + + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j+4(%rsp) # save the f1 value + + + addps .L__real_1_6(%rip),%xmm4 # +1/6 + + mulps %xmm2,%xmm3 # x^3 + mov p_j+8(%rsp),%eax # get an individual index + + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j+8(%rsp) # save the f1 value + + mulps .L__real_half(%rip),%xmm2 # x^2/2 + mov p_j+12(%rsp),%eax # get an individual index + + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j+12(%rsp) # save the f1 value + + mulps %xmm3,%xmm4 # *x^3 + mov p_j2(%rsp),%eax # get an individual index + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j2(%rsp) # save the f1 value + + + addps %xmm4,%xmm1 # +r2 + + addps %xmm2,%xmm1 # + x^2/2 + addps %xmm1,%xmm0 # +r1 + + movaps %xmm8,%xmm9 + mov p_j2+4(%rsp),%eax # get an individual index + movaps %xmm8,%xmm5 + mulps %xmm5,%xmm5 # x*x + mulps .L__real_1_24(%rip),%xmm9 # /24 + + movaps %xmm8,%xmm5 + mulps %xmm5,%xmm5 # x*x + mulps .L__real_1_24(%rip),%xmm9 # /24 + + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j2+4(%rsp) # save the f1 value + + +# deal with infinite or denormal results + movdqa p_m(%rsp),%xmm1 + movdqa p_m(%rsp),%xmm2 + pcmpgtd .L__int_127(%rip),%xmm2 + pminsw .L__int_128(%rip),%xmm1 # ceil at 128 + movmskps %xmm2,%eax + test $0x0f,%eax + + paddd .L__int_127(%rip),%xmm1 # add bias + +# *z2 = f2 + ((f1 + f2) * q); + mulps p_j(%rsp),%xmm0 # * f1 + addps p_j(%rsp),%xmm0 # + f1 + jnz .L__exp_largef +.L__check1: + + pxor %xmm2,%xmm2 # floor at 0 + pmaxsw %xmm2,%xmm1 + + pslld $23,%xmm1 # build 2^n + + movaps %xmm1,%xmm2 + + + +# check for infinity or nan + movaps p_ux(%rsp),%xmm1 + andps .L__real_infinity(%rip),%xmm1 + cmpps $0,.L__real_infinity(%rip),%xmm1 + movmskps %xmm1,%ebx + test $0x0f,%ebx + +# end of splitexp +# /* Scale (z1 + z2) by 2.0**m */ +# Step 3. Reconstitute. + + mulps %xmm2,%xmm0 # result *= 2^n + +# we'd like to avoid a branch, and can use cmp's and and's to +# eliminate them. But it adds cycles for normal cases +# to handle events that are supposed to be exceptions. +# Using this branch with the +# check above results in faster code for the normal cases. +# And branch mispredict penalties should only come into +# play for nans and infinities. + jnz .L__exp_naninf +.L__vsa_bottom1: + + # q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision + addps .L__real_1_6(%rip),%xmm9 # +1/6 + + mulps %xmm5,%xmm8 # x^3 + mov p_j2+8(%rsp),%eax # get an individual index + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j2+8(%rsp) # save the f1 value + + mulps .L__real_half(%rip),%xmm5 # x^2/2 + mulps %xmm8,%xmm9 # *x^3 + + mov p_j2+12(%rsp),%eax # get an individual index + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j2+12(%rsp) # save the f1 value + + addps %xmm9,%xmm7 # +r2 + + addps %xmm5,%xmm7 # + x^2/2 + addps %xmm7,%xmm6 # +r1 + + + # deal with infinite or denormal results + movdqa p_m2(%rsp),%xmm7 + movdqa p_m2(%rsp),%xmm5 + pcmpgtd .L__int_127(%rip),%xmm5 + pminsw .L__int_128(%rip),%xmm7 # ceil at 128 + movmskps %xmm5,%eax + test $0x0f,%eax + + paddd .L__int_127(%rip),%xmm7 # add bias + + # *z2 = f2 + ((f1 + f2) * q); + mulps p_j2(%rsp),%xmm6 # * f1 + addps p_j2(%rsp),%xmm6 # + f1 + jnz .L__exp_largef2 +.L__check2: + pxor %xmm1,%xmm1 # floor at 0 + pmaxsw %xmm1,%xmm7 + + pslld $23,%xmm7 # build 2^n + + movaps %xmm7,%xmm1 + + + # check for infinity or nan + movaps p_ux2(%rsp),%xmm7 + andps .L__real_infinity(%rip),%xmm7 + cmpps $0,.L__real_infinity(%rip),%xmm7 + movmskps %xmm7,%ebx + test $0x0f,%ebx + + + # end of splitexp + # /* Scale (z1 + z2) by 2.0**m */ + # Step 3. Reconstitute. + + mulps %xmm6,%xmm1 # result *= 2^n + + jnz .L__exp_naninf2 + +.L__vsa_bottom2: + + + +# +.L__final_check: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + +# at least one of the numbers needs special treatment +.L__exp_naninf: + lea p_ux(%rsp),%rcx + lea p_j(%rsp),%rsi + call .L__fexp_naninf + jmp .L__vsa_bottom1 +.L__exp_naninf2: + lea p_ux2(%rsp),%rcx + lea p_j(%rsp),%rsi + movaps %xmm0,%xmm2 + movaps %xmm1,%xmm0 + call .L__fexp_naninf + movaps %xmm0,%xmm1 + movaps %xmm2,%xmm0 + jmp .L__vsa_bottom2 + +# deal with nans and infinities +# This subroutine checks a packed single for nans and infinities and +# produces the proper result from the exceptional inputs +# Register assumptions: +# Inputs: +# rbx - mask of errors +# xmm0 - computed result vector +# Outputs: +# xmm0 - new result vector +# %rax,rdx,rbx,%xmm2 all modified. + +.L__fexp_naninf: + sub $0x018,%rsp + movaps %xmm0,(%rsi) # save the computed values + test $1,%ebx # first value? + jz .L__Lni2 + mov 0(%rcx),%edx # get the input + call .L__naninf + mov %edx,0(%rsi) # copy the result +.L__Lni2: + test $2,%ebx # second value? + jz .L__Lni3 + mov 4(%rcx),%edx # get the input + call .L__naninf + mov %edx,4(%rsi) # copy the result +.L__Lni3: + test $4,%ebx # third value? + jz .L__Lni4 + mov 8(%rcx),%edx # get the input + call .L__naninf + mov %edx,8(%rsi) # copy the result +.L__Lni4: + test $8,%ebx # fourth value? + jz .L__Lnie + mov 12(%rcx),%edx # get the input + call .L__naninf + mov %edx,12(%rsi) # copy the result +.L__Lnie: + movaps (%rsi),%xmm0 # get the answers + add $0x018,%rsp + ret + +# +# a simple subroutine to check a scalar input value for infinity +# or NaN and return the correct result +# expects input in .Land,%edx returns value in edx. Destroys eax. +.L__naninf: + mov $0x0007FFFFF,%eax + test %eax,%edx + jnz .L__enan # jump if mantissa not zero, so it's a NaN +# inf + mov %edx,%eax + rcl $1,%eax + jnc .L__r # exp(+inf) = inf + xor %edx,%edx # exp(-inf) = 0 + jmp .L__r + +#NaN +.L__enan: + mov $0x000400000,%eax # convert to quiet + or %eax,%edx +.L__r: + ret + .align 16 +# deal with m > 127. In some instances, rounding during calculations +# can result in infinity when it shouldn't. For these cases, we scale +# m down, and scale the mantissa up. + +.L__exp_largef: + movdqa %xmm0,p_j(%rsp) # save the mantissa portion + movdqa %xmm1,p_m(%rsp) # save the exponent portion + mov %eax,%ecx # save the error mask + test $1,%ecx # first value? + jz .L__Lf2 + mov p_m(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m(%rsp) # save the exponent + movss p_j(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j(%rsp) # save the mantissa +.L__Lf2: + test $2,%ecx # second value? + jz .L__Lf3 + mov p_m+4(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m+4(%rsp) # save the exponent + movss p_j+4(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+4(%rsp) # save the mantissa +.L__Lf3: + test $4,%ecx # third value? + jz .L__Lf4 + mov p_m+8(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m+8(%rsp) # save the exponent + movss p_j+8(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+8(%rsp) # save the mantissa +.L__Lf4: + test $8,%ecx # fourth value? + jz .L__Lfe + mov p_m+12(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m+12(%rsp) # save the exponent + movss p_j+12(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+12(%rsp) # save the mantissa +.L__Lfe: + movaps p_j(%rsp),%xmm0 # restore the mantissa portion back + movdqa p_m(%rsp),%xmm1 # restore the exponent portion + jmp .L__check1 + + .align 16 + +.L__exp_largef2: + movdqa %xmm6,p_j(%rsp) # save the mantissa portion + movdqa %xmm7,p_m2(%rsp) # save the exponent portion + mov %eax,%ecx # save the error mask + test $1,%ecx # first value? + jz .L__Lf22 + mov p_m2+0(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m2+0(%rsp) # save the exponent + movss p_j+0(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+0(%rsp) # save the mantissa +.L__Lf22: + test $2,%ecx # second value? + jz .L__Lf32 + mov p_m2+4(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m2+4(%rsp) # save the exponent + movss p_j+4(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+4(%rsp) # save the mantissa +.L__Lf32: + test $4,%ecx # third value? + jz .L__Lf42 + mov p_m2+8(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m2+8(%rsp) # save the exponent + movss p_j+8(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+8(%rsp) # save the mantissa +.L__Lf42: + test $8,%ecx # fourth value? + jz .L__Lfe2 + mov p_m2+12(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m2+12(%rsp) # save the exponent + movss p_j+12(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+12(%rsp) # save the mantissa +.L__Lfe2: + movaps p_j(%rsp),%xmm6 # restore the mantissa portion back + movdqa p_m2(%rsp),%xmm7 # restore the exponent portion + jmp .L__check2 + + .data # MUCH better performance without this on my tests + .align 64 +.L__real_half: .long 0x03f000000 # 1/2 + .long 0x03f000000 + .long 0x03f000000 + .long 0x03f000000 +.L__real_two: .long 0x40000000 # 2 + .long 0x40000000 + .long 0x40000000 + .long 0x40000000 + +.L__real_8192: .long 0x46000000 # 8192, to protect against really large numbers + .long 0x46000000 + .long 0x46000000 + .long 0x46000000 +.L__real_m8192: .long 0xC6000000 # -8192, to protect against really small numbers + .long 0xC6000000 + .long 0xC6000000 + .long 0xC6000000 +.L__real_thirtytwo_by_log2: .long 0x04238AA3B # thirtytwo_by_log2 + .long 0x04238AA3B + .long 0x04238AA3B + .long 0x04238AA3B +.L__real_log2_by_32: .long 0x03CB17218 # log2_by_32 + .long 0x03CB17218 + .long 0x03CB17218 + .long 0x03CB17218 +.L__real_log2_by_32_head: .long 0x03CB17000 # log2_by_32 + .long 0x03CB17000 + .long 0x03CB17000 + .long 0x03CB17000 +.L__real_log2_by_32_tail: .long 0x0B585FDF4 # log2_by_32 + .long 0x0B585FDF4 + .long 0x0B585FDF4 + .long 0x0B585FDF4 +.L__real_1_6: .long 0x03E2AAAAB # 0.16666666666 used in polynomial + .long 0x03E2AAAAB + .long 0x03E2AAAAB + .long 0x03E2AAAAB +.L__real_1_24: .long 0x03D2AAAAB # 0.041666668 used in polynomial + .long 0x03D2AAAAB + .long 0x03D2AAAAB + .long 0x03D2AAAAB +.L__real_1_120: .long 0x03C088889 # 0.0083333338 used in polynomial + .long 0x03C088889 + .long 0x03C088889 + .long 0x03C088889 +.L__real_infinity: .long 0x07f800000 # infinity + .long 0x07f800000 + .long 0x07f800000 + .long 0x07f800000 +.L__int_mask_1f: .long 0x00000001f + .long 0x00000001f + .long 0x00000001f + .long 0x00000001f +.L__int_128: .long 0x000000080 + .long 0x000000080 + .long 0x000000080 + .long 0x000000080 +.L__int_127: .long 0x00000007f + .long 0x00000007f + .long 0x00000007f + .long 0x00000007f + +.L__two_to_jby32_table: + .long 0x03F800000 # 1.0000000000000000 + .long 0x03F82CD87 # 1.0218971486541166 + .long 0x03F85AAC3 # 1.0442737824274138 + .long 0x03F88980F # 1.0671404006768237 + .long 0x03F8B95C2 # 1.0905077326652577 + .long 0x03F8EA43A # 1.1143867425958924 + .long 0x03F91C3D3 # 1.1387886347566916 + .long 0x03F94F4F0 # 1.1637248587775775 + .long 0x03F9837F0 # 1.1892071150027210 + .long 0x03F9B8D3A # 1.2152473599804690 + .long 0x03F9EF532 # 1.2418578120734840 + .long 0x03FA27043 # 1.2690509571917332 + .long 0x03FA5FED7 # 1.2968395546510096 + .long 0x03FA9A15B # 1.3252366431597413 + .long 0x03FAD583F # 1.3542555469368927 + .long 0x03FB123F6 # 1.3839098819638320 + .long 0x03FB504F3 # 1.4142135623730951 + .long 0x03FB8FBAF # 1.4451808069770467 + .long 0x03FBD08A4 # 1.4768261459394993 + .long 0x03FC12C4D # 1.5091644275934228 + .long 0x03FC5672A # 1.5422108254079407 + .long 0x03FC9B9BE # 1.5759808451078865 + .long 0x03FCE248C # 1.6104903319492543 + .long 0x03FD2A81E # 1.6457554781539649 + .long 0x03FD744FD # 1.6817928305074290 + .long 0x03FDBFBB8 # 1.7186192981224779 + .long 0x03FE0CCDF # 1.7562521603732995 + .long 0x03FE5B907 # 1.7947090750031072 + .long 0x03FEAC0C7 # 1.8340080864093424 + .long 0x03FEFE4BA # 1.8741676341103000 + .long 0x03FF5257D # 1.9152065613971474 + .long 0x03FFA83B3 # 1.9571441241754002 + .long 0 # for alignment +
diff --git a/src/gas/vrs8log10f.S b/src/gas/vrs8log10f.S new file mode 100644 index 0000000..b0a2a67 --- /dev/null +++ b/src/gas/vrs8log10f.S
@@ -0,0 +1,967 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrs8logf.s +# +# A vector implementation of the logf libm function. +# This routine implemented in single precision. It is slightly +# less accurate than the double precision version, but it will +# be better for vectorizing. +# +# Prototype: +# +# __m128,__m128 __vrs8_log10f(__m128 x1, __m128 x2); +# +# Computes the natural log of x for eight packed single values. +# Places the results into xmm0 and xmm1. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. +# +# This array version is basically a unrolling of the by4 scalar single +# routine. The second set of operations is performed by the indented +# instructions interleaved into the first set. +# The scheduling is done by trial and error. The resulting code represents +# the best time of many variations. It would seem more interleaving could +# be done, as there is a long stretch of the second computation that is not +# interleaved. But moving any of this code forward makes the routine +# slower. +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + .text + .align 16 + .p2align 4,,15 + +# define local variable storage offsets +.equ p_x,0 # save x +.equ p_idx,0x010 # xmmword index +.equ p_z1,0x020 # xmmword index +.equ p_q,0x030 # xmmword index +.equ p_corr,0x040 # xmmword index +.equ p_omask,0x050 # xmmword index +.equ save_xmm6,0x060 # +.equ save_rbx,0x070 # +.equ save_xmm7,0x080 # +.equ save_xmm8,0x090 # +.equ save_xmm9,0x0a0 # +.equ save_xmm10,0x0b0 # +.equ save_xmm11,0x0c0 # +.equ save_xmm12,0x0d0 # +.equ save_xmm13,0x0d0 # +.equ p_x2,0x0100 # save x +.equ p_idx2,0x0110 # xmmword index +.equ p_z12,0x0120 # xmmword index +.equ p_q2,0x0130 # xmmword index + +.equ stack_size,0x0168 + + + +.globl __vrs8_log10f + .type __vrs8_log10f,@function +__vrs8_log10f: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx + +# check e as a special case + movdqa %xmm0,p_x(%rsp) # save x + movdqa %xmm1,p_x2(%rsp) # save x +# movdqa %xmm0,%xmm2 +# cmpps $0,.L__real_ef(%rip),%xmm2 +# movmskps %xmm2,%r9d + + movdqa %xmm1,%xmm12 + movdqa %xmm1,%xmm9 + movaps %xmm1,%xmm7 + +# +# compute the index into the log tables +# + movdqa %xmm0,%xmm3 + movaps %xmm0,%xmm1 + psrld $23,%xmm3 + + # + # compute the index into the log tables + # + psrld $23,%xmm9 + subps .L__real_one(%rip),%xmm7 + psubd .L__mask_127(%rip),%xmm9 + subps .L__real_one(%rip),%xmm1 + psubd .L__mask_127(%rip),%xmm3 + cvtdq2ps %xmm9,%xmm13 # xexp + + movdqa %xmm12,%xmm9 + pand .L__real_mant(%rip),%xmm9 + xor %r8,%r8 + movdqa %xmm9,%xmm8 + movaps .L__real_half(%rip),%xmm11 # .5 + cvtdq2ps %xmm3,%xmm6 # xexp + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + xor %r8,%r8 + movdqa %xmm3,%xmm2 + movaps .L__real_half(%rip),%xmm5 # .5 + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrld $16,%xmm3 + lea .L__np_ln_lead_table(%rip),%rdx + movdqa %xmm3,%xmm4 + psrld $16,%xmm9 + movdqa %xmm9,%xmm10 + psrld $1,%xmm9 + psrld $1,%xmm3 + paddd .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm4 + paddd %xmm4,%xmm3 + cvtdq2ps %xmm3,%xmm1 + #/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + paddd .L__mask_040(%rip),%xmm9 + pand .L__mask_001(%rip),%xmm10 + paddd %xmm10,%xmm9 + cvtdq2ps %xmm9,%xmm7 + packssdw %xmm3,%xmm3 + movq %xmm3,p_idx(%rsp) + packssdw %xmm9,%xmm9 + movq %xmm9,p_idx2(%rsp) + + +# reduce and get u + movdqa %xmm0,%xmm3 + orps .L__real_half(%rip),%xmm2 + + + mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128 + # reduce and get u + + + subps %xmm1,%xmm2 # f2 = f - f1 + mulps %xmm2,%xmm5 + addps %xmm5,%xmm1 + + divps %xmm1,%xmm2 # u + + movdqa %xmm12,%xmm9 + orps .L__real_half(%rip),%xmm8 + + + mulps .L__real_3c000000(%rip),%xmm7 # f1 = index/128 + subps %xmm7,%xmm8 # f2 = f - f1 + mulps %xmm8,%xmm11 + addps %xmm11,%xmm7 + + + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1(%rsp) # save the f1 values + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1+8(%rsp) # save the f1 value + divps %xmm7,%xmm8 # u + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx2(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z12(%rsp) # save the f1 values + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z12+8(%rsp) # save the f1 value +# solve for ln(1+u) + movaps %xmm2,%xmm1 # u + mulps %xmm2,%xmm2 # u^2 + movaps %xmm2,%xmm5 + movaps .L__real_cb3(%rip),%xmm3 + mulps %xmm2,%xmm3 #Cu2 + mulps %xmm1,%xmm5 # u^3 + addps .L__real_cb2(%rip),%xmm3 #B+Cu2 + movaps %xmm2,%xmm4 + mulps %xmm5,%xmm4 # u^5 + movaps .L__real_log2_lead(%rip),%xmm2 + + mulps .L__real_cb1(%rip),%xmm5 #Au3 + addps %xmm5,%xmm1 # u+Au3 + mulps %xmm3,%xmm4 # u5(B+Cu2) + + lea .L__np_ln_tail_table(%rip),%rdx + addps %xmm4,%xmm1 # poly + +# recombine + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q(%rsp) # save the f2 value + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q+8(%rsp) # save the f2 value + + addps p_q(%rsp),%xmm1 #z2 +=q + + movaps p_z1(%rsp),%xmm0 # z1 values + + mulps %xmm6,%xmm2 + addps %xmm2,%xmm0 #r1 + movaps %xmm0,%xmm2 + mulps .L__real_log2_tail(%rip),%xmm6 + addps %xmm6,%xmm1 #r2 + movaps %xmm1,%xmm3 + +# logef to log10f + mulps .L__real_log10e_tail(%rip),%xmm1 + mulps .L__real_log10e_tail(%rip),%xmm0 + mulps .L__real_log10e_lead(%rip),%xmm3 + mulps .L__real_log10e_lead(%rip),%xmm2 + addps %xmm1,%xmm0 + addps %xmm3,%xmm0 + addps %xmm2,%xmm0 +# addps %xmm1,%xmm0 + + + +# check for e +# test $0x0f,%r9d +# jnz .L__vlogf_e +.L__f1: + +# check for negative numbers or zero + xorps %xmm1,%xmm1 + cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also. + movmskps %xmm1,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg + +.L__f2: +## if +inf + movaps p_x(%rsp),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__log_inf +.L__f3: + + movaps p_x(%rsp),%xmm3 + subps .L__real_one(%rip),%xmm3 + andps .L__real_notsign(%rip),%xmm3 + cmpps $2,.L__real_threshold(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__near_one +.L__f4: + +# finish the second set of calculations + + # solve for ln(1+u) + movaps %xmm8,%xmm7 # u + mulps %xmm8,%xmm8 # u^2 + movaps %xmm8,%xmm11 + + movaps .L__real_cb3(%rip),%xmm9 + mulps %xmm8,%xmm9 #Cu2 + mulps %xmm7,%xmm11 # u^3 + addps .L__real_cb2(%rip),%xmm9 #B+Cu2 + movaps %xmm8,%xmm10 + mulps %xmm11,%xmm10 # u^5 + movaps .L__real_log2_lead(%rip),%xmm8 + + mulps .L__real_cb1(%rip),%xmm11 #Au3 + addps %xmm11,%xmm7 # u+Au3 + mulps %xmm9,%xmm10 # u5(B+Cu2) + addps %xmm10,%xmm7 # poly + + + # recombine + lea .L__np_ln_tail_table(%rip),%rdx + mov p_idx2(%rsp),%rcx # get the indexes + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q2(%rsp) # save the f2 value + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q2+8(%rsp) # save the f2 value + addps p_q2(%rsp),%xmm7 #z2 +=q + movaps p_z12(%rsp),%xmm1 # z1 values + + mulps %xmm13,%xmm8 + addps %xmm8,%xmm1 #r1 + movaps %xmm1,%xmm8 + mulps .L__real_log2_tail(%rip),%xmm13 + addps %xmm13,%xmm7 #r2 + movaps %xmm7,%xmm9 + + # logef to log10f + mulps .L__real_log10e_tail(%rip),%xmm7 + mulps .L__real_log10e_tail(%rip),%xmm1 + mulps .L__real_log10e_lead(%rip),%xmm9 + mulps .L__real_log10e_lead(%rip),%xmm8 + addps %xmm7,%xmm1 + addps %xmm9,%xmm1 + addps %xmm8,%xmm1 + +# addps %xmm7,%xmm1 + + # check e as a special case +# movaps p_x2(%rsp),%xmm10 +# cmpps $0,.L__real_ef(%rip),%xmm10 +# movmskps %xmm10,%r9d + # check for e +# test $0x0f,%r9d +# jnz .L__vlogf_e2 +.L__f12: + + # check for negative numbers or zero + xorps %xmm7,%xmm7 + cmpps $1,p_x2(%rsp),%xmm7 # 0 greater than =?. catches NaNs also. + movmskps %xmm7,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg2 + +.L__f22: + ## if +inf + movaps p_x2(%rsp),%xmm9 + cmpps $0,.L__real_inf(%rip),%xmm9 + movmskps %xmm9,%r9d + test $0x0f,%r9d + jnz .L__log_inf2 +.L__f32: + + movaps p_x2(%rsp),%xmm9 + subps .L__real_one(%rip),%xmm9 + andps .L__real_notsign(%rip),%xmm9 + cmpps $2,.L__real_threshold(%rip),%xmm9 + movmskps %xmm9,%r9d + test $0x0f,%r9d + jnz .L__near_one2 +.L__f42: + + +.L__finish: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + +.L__vlogf_e: + movdqa p_x(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm0,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + jmp .L__f1 + +.L__vlogf_e2: + movdqa p_x2(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm1,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + jmp .L__f12 + + .align 16 +.L__near_one: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm3,p_omask(%rsp) # save ones mask + movaps p_x(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r +# u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm1 + divps %xmm2,%xmm1 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm1,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction +# u = u + u; + addps %xmm1,%xmm1 #u + movaps %xmm1,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm1,%xmm5 # Cu + movaps %xmm1,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm1 + mulps %xmm1,%xmm1 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm1 #u6(Cu+Du3) + addps %xmm1,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + +# loge to log10 + movaps %xmm3,%xmm5 #r1=r + pand .L__mask_lower(%rip),%xmm5 + subps %xmm5,%xmm3 + addps %xmm3,%xmm2 #r2 = r2 + (r-r1) + + movaps %xmm5,%xmm3 + movaps %xmm2,%xmm1 + + mulps .L__real_log10e_tail(%rip),%xmm2 + mulps .L__real_log10e_tail(%rip),%xmm3 + mulps .L__real_log10e_lead(%rip),%xmm1 + mulps .L__real_log10e_lead(%rip),%xmm5 + addps %xmm2,%xmm3 + addps %xmm1,%xmm3 + addps %xmm5,%xmm3 +# return r + r2; +# addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm0,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + + jmp .L__f4 + + + .align 16 +.L__near_one2: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm9,p_omask(%rsp) # save ones mask + movaps p_x2(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r + # u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm7 + divps %xmm2,%xmm7 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C + # correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm7,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction + # u = u + u; + addps %xmm7,%xmm7 #u + movaps %xmm7,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 + # r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm7,%xmm5 # Cu + movaps %xmm7,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm7 + mulps %xmm7,%xmm7 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm7 #u6(Cu+Du3) + addps %xmm7,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + + #loge to log10 + movaps %xmm3,%xmm5 #r1=r + pand .L__mask_lower(%rip),%xmm5 + subps %xmm5,%xmm3 + addps %xmm3,%xmm2 #r2 = r2 + (r-r1) + + movaps %xmm5,%xmm3 + movaps %xmm2,%xmm7 + + mulps .L__real_log10e_tail(%rip),%xmm2 + mulps .L__real_log10e_tail(%rip),%xmm3 + mulps .L__real_log10e_lead(%rip),%xmm7 + mulps .L__real_log10e_lead(%rip),%xmm5 + addps %xmm2,%xmm3 + addps %xmm7,%xmm3 + addps %xmm5,%xmm3 + + # return r + r2; +# addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm1,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + + jmp .L__f42 + +# we have a zero, a negative number, or both. +# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf. +.L__z_or_neg: +# deal with negatives first + movdqa %xmm1,%xmm3 + andps %xmm0,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm1 # setup the nan values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace +# check for +/- 0 + xorps %xmm1,%xmm1 + cmpps $0,p_x(%rsp),%xmm1 # 0 ?. + movmskps %xmm1,%r9d + test $0x0f,%r9d + jz .L__zn2 + + movdqa %xmm1,%xmm3 + andnps %xmm0,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + +.L__zn2: +# check for NaNs + movaps p_x(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x(%rsp),%xmm1 # isolate the NaNs + pand %xmm4,%xmm1 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm1,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm1 + andnps %xmm0,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f2 + +# handle only +inf log(+inf) = inf +.L__log_inf: + movdqa %xmm3,%xmm1 + andnps %xmm0,%xmm3 # keep the non-error values + andps p_x(%rsp),%xmm1 # setup the +inf values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + jmp .L__f3 + + +.L__z_or_neg2: + # deal with negatives first + movdqa %xmm7,%xmm3 + andps %xmm1,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm7 # setup the nan values + orps %xmm3,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + # check for +/- 0 + xorps %xmm7,%xmm7 + cmpps $0,p_x2(%rsp),%xmm7 # 0 ?. + movmskps %xmm7,%r9d + test $0x0f,%r9d + jz .L__zn22 + + movdqa %xmm7,%xmm3 + andnps %xmm1,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm7 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + +.L__zn22: + # check for NaNs + movaps p_x2(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x2(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x2(%rsp),%xmm7 # isolate the NaNs + pand %xmm4,%xmm7 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm7,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm7 + andnps %xmm1,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f22 + + # handle only +inf log(+inf) = inf +.L__log_inf2: + movdqa %xmm9,%xmm7 + andnps %xmm1,%xmm9 # keep the non-error values + andps p_x2(%rsp),%xmm7 # setup the +inf values + orps %xmm9,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + jmp .L__f32 + + + .data + .align 64 + +.L__real_zero: .quad 0x00000000000000000 # 1.0 + .quad 0x00000000000000000 +.L__real_one: .quad 0x03f8000003f800000 # 1.0 + .quad 0x03f8000003f800000 +.L__real_two: .quad 0x04000000040000000 # 1.0 + .quad 0x04000000040000000 +.L__real_ninf: .quad 0x0ff800000ff800000 # -inf + .quad 0x0ff800000ff800000 +.L__real_inf: .quad 0x07f8000007f800000 # +inf + .quad 0x07f8000007f800000 +.L__real_nan: .quad 0x07fc000007fc00000 # NaN + .quad 0x07fc000007fc00000 +.L__real_ef: .quad 0x0402DF854402DF854 # float e + .quad 0x0402DF854402DF854 + +.L__real_sign: .quad 0x08000000080000000 # sign bit + .quad 0x08000000080000000 +.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit + .quad 0x07ffFFFFF7ffFFFFF +.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit + .quad 0x00040000000400000 +.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits + .quad 0x0007FFFFF007FFFFF +.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */ + .quad 0x03c0000003c000000 +.L__mask_127: .quad 0x00000007f0000007f # + .quad 0x00000007f0000007f +.L__mask_040: .quad 0x00000004000000040 # + .quad 0x00000004000000040 +.L__mask_001: .quad 0x00000000100000001 # + .quad 0x00000000100000001 + + +.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03 + .quad 0x03CF5C28F3CF5C28F + +.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03 + .quad 0x03B1249183B124918 +.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04 + .quad 0x039E401A639E401A6 +.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03 + .quad 0x03B124A123B124A12 +.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375 + .quad 0x03F3170003F317000 +.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183 + .quad 0x03805FDF43805FDF4 +.L__real_half: .quad 0x03f0000003f000000 # 1/2 + .quad 0x03f0000003f000000 + +.L__real_log10e_lead: .quad 0x03EDE00003EDE0000 # log10e_lead 0.4335937500 + .quad 0x03EDE00003EDE0000 +.L__real_log10e_tail: .quad 0x03A37B1523A37B152 # log10e_tail 0.0007007319 + .quad 0x03A37B1523A37B152 + + +.L__mask_lower: .quad 0x0ffff0000ffff0000 # + .quad 0x0ffff0000ffff0000 +.L__np_ln__table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02 + .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02 + .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02 + .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02 + .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02 + .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02 + .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01 + .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01 + .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01 + .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01 + .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01 + .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01 + .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01 + .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01 + .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01 + .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01 + .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01 + .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01 + .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01 + .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01 + .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01 + .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01 + .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01 + .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01 + .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01 + .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01 + .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01 + .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01 + .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01 + .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01 + .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01 + .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01 + .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01 + .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01 + .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01 + .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01 + .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01 + .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01 + .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01 + .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01 + .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01 + .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01 + .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01 + .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01 + .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01 + .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01 + .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01 + .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01 + .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01 + .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01 + .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01 + .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01 + .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01 + .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01 + .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01 + .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01 + .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01 + .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01 + .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01 + .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01 + .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01 + .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01 + .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01 + .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_lead_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x3C7E0000 # 0.015502929688 1 + .long 0x3CFC1000 # 0.030769348145 2 + .long 0x3D3BA000 # 0.045806884766 3 + .long 0x3D785000 # 0.060623168945 4 + .long 0x3D9A0000 # 0.075195312500 5 + .long 0x3DB78000 # 0.089599609375 6 + .long 0x3DD49000 # 0.103790283203 7 + .long 0x3DF13000 # 0.117767333984 8 + .long 0x3E06B000 # 0.131530761719 9 + .long 0x3E14A000 # 0.145141601563 10 + .long 0x3E226000 # 0.158569335938 11 + .long 0x3E2FF000 # 0.171813964844 12 + .long 0x3E3D5000 # 0.184875488281 13 + .long 0x3E4A9000 # 0.197814941406 14 + .long 0x3E579000 # 0.210510253906 15 + .long 0x3E647000 # 0.223083496094 16 + .long 0x3E713000 # 0.235534667969 17 + .long 0x3E7DC000 # 0.247802734375 18 + .long 0x3E851000 # 0.259887695313 19 + .long 0x3E8B3000 # 0.271850585938 20 + .long 0x3E914000 # 0.283691406250 21 + .long 0x3E974000 # 0.295410156250 22 + .long 0x3E9D3000 # 0.307006835938 23 + .long 0x3EA30000 # 0.318359375000 24 + .long 0x3EA8D000 # 0.329711914063 25 + .long 0x3EAE8000 # 0.340820312500 26 + .long 0x3EB43000 # 0.351928710938 27 + .long 0x3EB9C000 # 0.362792968750 28 + .long 0x3EBF5000 # 0.373657226563 29 + .long 0x3EC4D000 # 0.384399414063 30 + .long 0x3ECA3000 # 0.394897460938 31 + .long 0x3ECF9000 # 0.405395507813 32 + .long 0x3ED4E000 # 0.415771484375 33 + .long 0x3EDA2000 # 0.426025390625 34 + .long 0x3EDF5000 # 0.436157226563 35 + .long 0x3EE47000 # 0.446166992188 36 + .long 0x3EE99000 # 0.456176757813 37 + .long 0x3EEEA000 # 0.466064453125 38 + .long 0x3EF3A000 # 0.475830078125 39 + .long 0x3EF89000 # 0.485473632813 40 + .long 0x3EFD7000 # 0.494995117188 41 + .long 0x3F012000 # 0.504394531250 42 + .long 0x3F039000 # 0.513916015625 43 + .long 0x3F05F000 # 0.523193359375 44 + .long 0x3F084000 # 0.532226562500 45 + .long 0x3F0AA000 # 0.541503906250 46 + .long 0x3F0CF000 # 0.550537109375 47 + .long 0x3F0F4000 # 0.559570312500 48 + .long 0x3F118000 # 0.568359375000 49 + .long 0x3F13C000 # 0.577148437500 50 + .long 0x3F160000 # 0.585937500000 51 + .long 0x3F183000 # 0.594482421875 52 + .long 0x3F1A7000 # 0.603271484375 53 + .long 0x3F1C9000 # 0.611572265625 54 + .long 0x3F1EC000 # 0.620117187500 55 + .long 0x3F20E000 # 0.628417968750 56 + .long 0x3F230000 # 0.636718750000 57 + .long 0x3F252000 # 0.645019531250 58 + .long 0x3F273000 # 0.653076171875 59 + .long 0x3F295000 # 0.661376953125 60 + .long 0x3F2B5000 # 0.669189453125 61 + .long 0x3F2D6000 # 0.677246093750 62 + .long 0x3F2F7000 # 0.685302734375 63 + .long 0x3F317000 # 0.693115234375 64 + .long 0 # for alignment + +.L__np_ln_tail_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x35A8B0FC # 0.000001256848 1 + .long 0x361B0E78 # 0.000002310522 2 + .long 0x3631EC66 # 0.000002651266 3 + .long 0x35C30046 # 0.000001452871 4 + .long 0x37EBCB0E # 0.000028108738 5 + .long 0x37528AE5 # 0.000012549314 6 + .long 0x36DA7496 # 0.000006510479 7 + .long 0x3783B715 # 0.000015701671 8 + .long 0x383F3E68 # 0.000045596069 9 + .long 0x38297C10 # 0.000040408282 10 + .long 0x3815B666 # 0.000035694240 11 + .long 0x38183854 # 0.000036292084 12 + .long 0x38448108 # 0.000046850211 13 + .long 0x373539E9 # 0.000010801924 14 + .long 0x3864A740 # 0.000054515200 15 + .long 0x387BE3CD # 0.000060055219 16 + .long 0x3803B715 # 0.000031403342 17 + .long 0x380C36AF # 0.000033429529 18 + .long 0x3892713A # 0.000069829126 19 + .long 0x38AE55D6 # 0.000083129547 20 + .long 0x38A0FDE8 # 0.000076766883 21 + .long 0x3862BAE1 # 0.000054056643 22 + .long 0x3798AAD3 # 0.000018199358 23 + .long 0x38C5E10E # 0.000094356117 24 + .long 0x382D872E # 0.000041372310 25 + .long 0x38DEDFAC # 0.000106274470 26 + .long 0x38481E9B # 0.000047712219 27 + .long 0x38EBFB5E # 0.000112524940 28 + .long 0x38783B83 # 0.000059183232 29 + .long 0x374E1B05 # 0.000012284848 30 + .long 0x38CA0E11 # 0.000096347307 31 + .long 0x3891F660 # 0.000069600297 32 + .long 0x386C9A9A # 0.000056410769 33 + .long 0x38777BCD # 0.000059004688 34 + .long 0x38A6CED4 # 0.000079540216 35 + .long 0x38FBE3CD # 0.000120110439 36 + .long 0x387E7E01 # 0.000060675669 37 + .long 0x37D40984 # 0.000025276800 38 + .long 0x3784C3AD # 0.000015826745 39 + .long 0x380F5FAF # 0.000034182969 40 + .long 0x38AC47BC # 0.000082149607 41 + .long 0x392952D3 # 0.000161479504 42 + .long 0x37F97073 # 0.000029735476 43 + .long 0x3865C84A # 0.000054784388 44 + .long 0x3979CF17 # 0.000238236375 45 + .long 0x38C3D2F5 # 0.000093376184 46 + .long 0x38E6B468 # 0.000110008579 47 + .long 0x383EBCE1 # 0.000045475437 48 + .long 0x39186BDF # 0.000145360347 49 + .long 0x392F0945 # 0.000166927537 50 + .long 0x38E9ED45 # 0.000111545007 51 + .long 0x396B99A8 # 0.000224685878 52 + .long 0x37A27674 # 0.000019367064 53 + .long 0x397069AB # 0.000229275480 54 + .long 0x39013539 # 0.000123222257 55 + .long 0x3947F423 # 0.000190690669 56 + .long 0x3945E10E # 0.000188712234 57 + .long 0x38F85DB0 # 0.000118430122 58 + .long 0x396C08DC # 0.000225100142 59 + .long 0x37B4996F # 0.000021529120 60 + .long 0x397CEADA # 0.000241200818 61 + .long 0x3920261B # 0.000152729845 62 + .long 0x35AA4906 # 0.000001268724 63 + .long 0x3805FDF4 # 0.000031946183 64 + .long 0 # for alignment + +
diff --git a/src/gas/vrs8log2f.S b/src/gas/vrs8log2f.S new file mode 100644 index 0000000..d1028b0 --- /dev/null +++ b/src/gas/vrs8log2f.S
@@ -0,0 +1,956 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrs8log2f.s +# +# A vector implementation of the logf libm function. +# This routine implemented in single precision. It is slightly +# less accurate than the double precision version, but it will +# be better for vectorizing. +# +# Prototype: +# +# __m128,__m128 __vrs8_log2f(__m128 x1, __m128 x2); +# +# Computes the natural log of x for eight packed single values. +# Places the results into xmm0 and xmm1. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. +# +# This array version is basically a unrolling of the by4 scalar single +# routine. The second set of operations is performed by the indented +# instructions interleaved into the first set. +# The scheduling is done by trial and error. The resulting code represents +# the best time of many variations. It would seem more interleaving could +# be done, as there is a long stretch of the second computation that is not +# interleaved. But moving any of this code forward makes the routine +# slower. +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + .text + .align 16 + .p2align 4,,15 + +# define local variable storage offsets +.equ p_x,0 # save x +.equ p_idx,0x010 # xmmword index +.equ p_z1,0x020 # xmmword index +.equ p_q,0x030 # xmmword index +.equ p_corr,0x040 # xmmword index +.equ p_omask,0x050 # xmmword index +.equ save_xmm6,0x060 # +.equ save_rbx,0x070 # +.equ save_xmm7,0x080 # +.equ save_xmm8,0x090 # +.equ save_xmm9,0x0a0 # +.equ save_xmm10,0x0b0 # +.equ save_xmm11,0x0c0 # +.equ save_xmm12,0x0d0 # +.equ save_xmm13,0x0d0 # +.equ p_x2,0x0100 # save x +.equ p_idx2,0x0110 # xmmword index +.equ p_z12,0x0120 # xmmword index +.equ p_q2,0x0130 # xmmword index + +.equ stack_size,0x0168 + + + +.globl __vrs8_log2f + .type __vrs8_log2f,@function +__vrs8_log2f: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx + +# check e as a special case + movdqa %xmm0,p_x(%rsp) # save x + movdqa %xmm1,p_x2(%rsp) # save x +# movdqa %xmm0,%xmm2 +# cmpps $0,.L__real_ef(%rip),%xmm2 +# movmskps %xmm2,%r9d + + movdqa %xmm1,%xmm12 + movdqa %xmm1,%xmm9 + movaps %xmm1,%xmm7 + +# +# compute the index into the log tables +# + movdqa %xmm0,%xmm3 + movaps %xmm0,%xmm1 + psrld $23,%xmm3 + + # + # compute the index into the log tables + # + psrld $23,%xmm9 + subps .L__real_one(%rip),%xmm7 + psubd .L__mask_127(%rip),%xmm9 + subps .L__real_one(%rip),%xmm1 + psubd .L__mask_127(%rip),%xmm3 + cvtdq2ps %xmm9,%xmm13 # xexp + + movdqa %xmm12,%xmm9 + pand .L__real_mant(%rip),%xmm9 + xor %r8,%r8 + movdqa %xmm9,%xmm8 + movaps .L__real_half(%rip),%xmm11 # .5 + cvtdq2ps %xmm3,%xmm6 # xexp + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + xor %r8,%r8 + movdqa %xmm3,%xmm2 + movaps .L__real_half(%rip),%xmm5 # .5 + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrld $16,%xmm3 + lea .L__np_ln_lead_table(%rip),%rdx + movdqa %xmm3,%xmm4 + psrld $16,%xmm9 + movdqa %xmm9,%xmm10 + psrld $1,%xmm9 + psrld $1,%xmm3 + paddd .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm4 + paddd %xmm4,%xmm3 + cvtdq2ps %xmm3,%xmm1 + #/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + paddd .L__mask_040(%rip),%xmm9 + pand .L__mask_001(%rip),%xmm10 + paddd %xmm10,%xmm9 + cvtdq2ps %xmm9,%xmm7 + packssdw %xmm3,%xmm3 + movq %xmm3,p_idx(%rsp) + packssdw %xmm9,%xmm9 + movq %xmm9,p_idx2(%rsp) + + +# reduce and get u + movdqa %xmm0,%xmm3 + orps .L__real_half(%rip),%xmm2 + + + mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128 + # reduce and get u + + + subps %xmm1,%xmm2 # f2 = f - f1 + mulps %xmm2,%xmm5 + addps %xmm5,%xmm1 + + divps %xmm1,%xmm2 # u + + movdqa %xmm12,%xmm9 + orps .L__real_half(%rip),%xmm8 + + + mulps .L__real_3c000000(%rip),%xmm7 # f1 = index/128 + subps %xmm7,%xmm8 # f2 = f - f1 + mulps %xmm8,%xmm11 + addps %xmm11,%xmm7 + + + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1(%rsp) # save the f1 values + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1+8(%rsp) # save the f1 value + divps %xmm7,%xmm8 # u + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx2(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z12(%rsp) # save the f1 values + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z12+8(%rsp) # save the f1 value +# solve for ln(1+u) + movaps %xmm2,%xmm1 # u + mulps %xmm2,%xmm2 # u^2 + movaps %xmm2,%xmm5 + movaps .L__real_cb3(%rip),%xmm3 + mulps %xmm2,%xmm3 #Cu2 + mulps %xmm1,%xmm5 # u^3 + addps .L__real_cb2(%rip),%xmm3 #B+Cu2 + movaps %xmm2,%xmm4 + mulps %xmm5,%xmm4 # u^5 + movaps .L__real_log2e_lead(%rip),%xmm2 + + mulps .L__real_cb1(%rip),%xmm5 #Au3 + addps %xmm5,%xmm1 # u+Au3 + mulps %xmm3,%xmm4 # u5(B+Cu2) + movaps .L__real_log2e_tail(%rip),%xmm3 + lea .L__np_ln_tail_table(%rip),%rdx + addps %xmm4,%xmm1 # poly + +# recombine + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q(%rsp) # save the f2 value + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q+8(%rsp) # save the f2 value + + addps p_q(%rsp),%xmm1 #z2 +=q + movaps %xmm1,%xmm4 #z2 copy + movaps p_z1(%rsp),%xmm0 # z1 values + movaps %xmm0,%xmm5 #z1 copy + + mulps %xmm2,%xmm5 #z1*log2e_lead + mulps %xmm2,%xmm1 #z2*log2e_lead + mulps %xmm3,%xmm4 #z2*log2e_tail + mulps %xmm3,%xmm0 #z1*log2e_tail + addps %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp + addps %xmm4,%xmm0 #z1*log2e_tail + z2*log2e_tail + addps %xmm1,%xmm0 #r2 +#return r1+r2 + addps %xmm5,%xmm0 # r1+ r2 + + + +# check for e +# test $0x0f,%r9d +# jnz .L__vlogf_e +.L__f1: + +# check for negative numbers or zero + xorps %xmm1,%xmm1 + cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also. + movmskps %xmm1,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg + +.L__f2: +## if +inf + movaps p_x(%rsp),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__log_inf +.L__f3: + + movaps p_x(%rsp),%xmm3 + subps .L__real_one(%rip),%xmm3 + andps .L__real_notsign(%rip),%xmm3 + cmpps $2,.L__real_threshold(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__near_one +.L__f4: + +# finish the second set of calculations + + # solve for ln(1+u) + movaps %xmm8,%xmm7 # u + mulps %xmm8,%xmm8 # u^2 + movaps %xmm8,%xmm11 + + movaps .L__real_cb3(%rip),%xmm9 + mulps %xmm8,%xmm9 #Cu2 + mulps %xmm7,%xmm11 # u^3 + addps .L__real_cb2(%rip),%xmm9 #B+Cu2 + movaps %xmm8,%xmm10 + mulps %xmm11,%xmm10 # u^5 + movaps .L__real_log2e_lead(%rip),%xmm8 + + mulps .L__real_cb1(%rip),%xmm11 #Au3 + addps %xmm11,%xmm7 # u+Au3 + mulps %xmm9,%xmm10 # u5(B+Cu2) + movaps .L__real_log2e_tail(%rip),%xmm9 + addps %xmm10,%xmm7 # poly + + + # recombine + lea .L__np_ln_tail_table(%rip),%rdx + mov p_idx2(%rsp),%rcx # get the indexes + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q2(%rsp) # save the f2 value + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q2+8(%rsp) # save the f2 value + addps p_q2(%rsp),%xmm7 #z2 +=q + movaps %xmm7,%xmm10 #z2 copy + movaps p_z12(%rsp),%xmm1 # z1 values + movaps %xmm1,%xmm11 #z1 copy + + mulps %xmm8,%xmm11 #z1*log2e_lead + mulps %xmm8,%xmm7 #z2*log2e_lead + mulps %xmm9,%xmm10 #z2*log2e_tail + mulps %xmm9,%xmm1 #z1*log2e_tail + addps %xmm13,%xmm11 #r1 = z1*log2e_lead + xexp + addps %xmm10,%xmm1 #z1*log2e_tail + z2*log2e_tail + addps %xmm7,%xmm1 #r2 + #return r1+r2 + addps %xmm11,%xmm1 # r1+ r2 + + # check e as a special case +# movaps p_x2(%rsp),%xmm10 +# cmpps $0,.L__real_ef(%rip),%xmm10 +# movmskps %xmm10,%r9d + # check for e +# test $0x0f,%r9d +# jnz .L__vlogf_e2 +.L__f12: + + # check for negative numbers or zero + xorps %xmm7,%xmm7 + cmpps $1,p_x2(%rsp),%xmm7 # 0 greater than =?. catches NaNs also. + movmskps %xmm7,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg2 + +.L__f22: + ## if +inf + movaps p_x2(%rsp),%xmm9 + cmpps $0,.L__real_inf(%rip),%xmm9 + movmskps %xmm9,%r9d + test $0x0f,%r9d + jnz .L__log_inf2 +.L__f32: + + movaps p_x2(%rsp),%xmm9 + subps .L__real_one(%rip),%xmm9 + andps .L__real_notsign(%rip),%xmm9 + cmpps $2,.L__real_threshold(%rip),%xmm9 + movmskps %xmm9,%r9d + test $0x0f,%r9d + jnz .L__near_one2 +.L__f42: + + +.L__finish: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + +.L__vlogf_e: + movdqa p_x(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm0,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + jmp .L__f1 + +.L__vlogf_e2: + movdqa p_x2(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm1,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + jmp .L__f12 + + .align 16 +.L__near_one: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm3,p_omask(%rsp) # save ones mask + movaps p_x(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r +# u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm1 + divps %xmm2,%xmm1 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm1,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction +# u = u + u; + addps %xmm1,%xmm1 #u + movaps %xmm1,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm1,%xmm5 # Cu + movaps %xmm1,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm1 + mulps %xmm1,%xmm1 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm1 #u6(Cu+Du3) + addps %xmm1,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + +# loge to log2 + movaps %xmm3,%xmm5 #r1=r + pand .L__mask_lower(%rip),%xmm5 + subps %xmm5,%xmm3 + addps %xmm3,%xmm2 #r2 = r2 + (r-r1) + + movaps %xmm5,%xmm3 + movaps %xmm2,%xmm1 + + mulps .L__real_log2e_tail(%rip),%xmm2 + mulps .L__real_log2e_tail(%rip),%xmm3 + mulps .L__real_log2e_lead(%rip),%xmm1 + mulps .L__real_log2e_lead(%rip),%xmm5 + addps %xmm2,%xmm3 + addps %xmm1,%xmm3 + addps %xmm5,%xmm3 + +# return r + r2; +# addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm0,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + + jmp .L__f4 + + + .align 16 +.L__near_one2: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm9,p_omask(%rsp) # save ones mask + movaps p_x2(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r + # u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm7 + divps %xmm2,%xmm7 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C + # correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm7,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction + # u = u + u; + addps %xmm7,%xmm7 #u + movaps %xmm7,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 + # r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm7,%xmm5 # Cu + movaps %xmm7,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm7 + mulps %xmm7,%xmm7 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm7 #u6(Cu+Du3) + addps %xmm7,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + +# loge to log2 + movaps %xmm3,%xmm5 #r1=r + pand .L__mask_lower(%rip),%xmm5 + subps %xmm5,%xmm3 + addps %xmm3,%xmm2 #r2 = r2 + (r-r1) + + movaps %xmm5,%xmm3 + movaps %xmm2,%xmm7 + + mulps .L__real_log2e_tail(%rip),%xmm2 + mulps .L__real_log2e_tail(%rip),%xmm3 + mulps .L__real_log2e_lead(%rip),%xmm7 + mulps .L__real_log2e_lead(%rip),%xmm5 + addps %xmm2,%xmm3 + addps %xmm7,%xmm3 + addps %xmm5,%xmm3 + + # return r + r2; + # addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm1,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + + jmp .L__f42 + +# we have a zero, a negative number, or both. +# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf. +.L__z_or_neg: +# deal with negatives first + movdqa %xmm1,%xmm3 + andps %xmm0,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm1 # setup the nan values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace +# check for +/- 0 + xorps %xmm1,%xmm1 + cmpps $0,p_x(%rsp),%xmm1 # 0 ?. + movmskps %xmm1,%r9d + test $0x0f,%r9d + jz .L__zn2 + + movdqa %xmm1,%xmm3 + andnps %xmm0,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + +.L__zn2: +# check for NaNs + movaps p_x(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x(%rsp),%xmm1 # isolate the NaNs + pand %xmm4,%xmm1 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm1,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm1 + andnps %xmm0,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f2 + +# handle only +inf log(+inf) = inf +.L__log_inf: + movdqa %xmm3,%xmm1 + andnps %xmm0,%xmm3 # keep the non-error values + andps p_x(%rsp),%xmm1 # setup the +inf values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + jmp .L__f3 + + +.L__z_or_neg2: + # deal with negatives first + movdqa %xmm7,%xmm3 + andps %xmm1,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm7 # setup the nan values + orps %xmm3,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + # check for +/- 0 + xorps %xmm7,%xmm7 + cmpps $0,p_x2(%rsp),%xmm7 # 0 ?. + movmskps %xmm7,%r9d + test $0x0f,%r9d + jz .L__zn22 + + movdqa %xmm7,%xmm3 + andnps %xmm1,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm7 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + +.L__zn22: + # check for NaNs + movaps p_x2(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x2(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x2(%rsp),%xmm7 # isolate the NaNs + pand %xmm4,%xmm7 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm7,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm7 + andnps %xmm1,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f22 + + # handle only +inf log(+inf) = inf +.L__log_inf2: + movdqa %xmm9,%xmm7 + andnps %xmm1,%xmm9 # keep the non-error values + andps p_x2(%rsp),%xmm7 # setup the +inf values + orps %xmm9,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + jmp .L__f32 + + + .data + .align 64 + +.L__real_zero: .quad 0x00000000000000000 # 1.0 + .quad 0x00000000000000000 +.L__real_one: .quad 0x03f8000003f800000 # 1.0 + .quad 0x03f8000003f800000 +.L__real_two: .quad 0x04000000040000000 # 1.0 + .quad 0x04000000040000000 +.L__real_ninf: .quad 0x0ff800000ff800000 # -inf + .quad 0x0ff800000ff800000 +.L__real_inf: .quad 0x07f8000007f800000 # +inf + .quad 0x07f8000007f800000 +.L__real_nan: .quad 0x07fc000007fc00000 # NaN + .quad 0x07fc000007fc00000 +.L__real_ef: .quad 0x0402DF854402DF854 # float e + .quad 0x0402DF854402DF854 + +.L__real_sign: .quad 0x08000000080000000 # sign bit + .quad 0x08000000080000000 +.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit + .quad 0x07ffFFFFF7ffFFFFF +.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit + .quad 0x00040000000400000 +.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits + .quad 0x0007FFFFF007FFFFF +.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */ + .quad 0x03c0000003c000000 +.L__mask_127: .quad 0x00000007f0000007f # + .quad 0x00000007f0000007f +.L__mask_040: .quad 0x00000004000000040 # + .quad 0x00000004000000040 +.L__mask_001: .quad 0x00000000100000001 # + .quad 0x00000000100000001 + + +.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03 + .quad 0x03CF5C28F3CF5C28F + +.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03 + .quad 0x03B1249183B124918 +.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04 + .quad 0x039E401A639E401A6 +.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03 + .quad 0x03B124A123B124A12 +.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375 + .quad 0x03F3170003F317000 +.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183 + .quad 0x03805FDF43805FDF4 +.L__real_half: .quad 0x03f0000003f000000 # 1/2 + .quad 0x03f0000003f000000 +.L__real_log2e_lead: .quad 0x03FB800003FB80000 #1.4375000000 + .quad 0x03FB800003FB80000 +.L__real_log2e_tail: .quad 0x03BAA3B293BAA3B29 # 0.0051950408889633 + .quad 0x03BAA3B293BAA3B29 + +.L__mask_lower: .quad 0x0ffff0000ffff0000 # + .quad 0x0ffff0000ffff0000 + +.L__np_ln__table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02 + .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02 + .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02 + .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02 + .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02 + .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02 + .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01 + .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01 + .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01 + .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01 + .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01 + .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01 + .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01 + .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01 + .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01 + .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01 + .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01 + .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01 + .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01 + .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01 + .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01 + .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01 + .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01 + .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01 + .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01 + .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01 + .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01 + .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01 + .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01 + .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01 + .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01 + .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01 + .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01 + .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01 + .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01 + .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01 + .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01 + .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01 + .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01 + .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01 + .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01 + .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01 + .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01 + .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01 + .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01 + .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01 + .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01 + .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01 + .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01 + .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01 + .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01 + .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01 + .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01 + .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01 + .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01 + .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01 + .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01 + .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01 + .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01 + .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01 + .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01 + .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01 + .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01 + .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_lead_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x3C7E0000 # 0.015502929688 1 + .long 0x3CFC1000 # 0.030769348145 2 + .long 0x3D3BA000 # 0.045806884766 3 + .long 0x3D785000 # 0.060623168945 4 + .long 0x3D9A0000 # 0.075195312500 5 + .long 0x3DB78000 # 0.089599609375 6 + .long 0x3DD49000 # 0.103790283203 7 + .long 0x3DF13000 # 0.117767333984 8 + .long 0x3E06B000 # 0.131530761719 9 + .long 0x3E14A000 # 0.145141601563 10 + .long 0x3E226000 # 0.158569335938 11 + .long 0x3E2FF000 # 0.171813964844 12 + .long 0x3E3D5000 # 0.184875488281 13 + .long 0x3E4A9000 # 0.197814941406 14 + .long 0x3E579000 # 0.210510253906 15 + .long 0x3E647000 # 0.223083496094 16 + .long 0x3E713000 # 0.235534667969 17 + .long 0x3E7DC000 # 0.247802734375 18 + .long 0x3E851000 # 0.259887695313 19 + .long 0x3E8B3000 # 0.271850585938 20 + .long 0x3E914000 # 0.283691406250 21 + .long 0x3E974000 # 0.295410156250 22 + .long 0x3E9D3000 # 0.307006835938 23 + .long 0x3EA30000 # 0.318359375000 24 + .long 0x3EA8D000 # 0.329711914063 25 + .long 0x3EAE8000 # 0.340820312500 26 + .long 0x3EB43000 # 0.351928710938 27 + .long 0x3EB9C000 # 0.362792968750 28 + .long 0x3EBF5000 # 0.373657226563 29 + .long 0x3EC4D000 # 0.384399414063 30 + .long 0x3ECA3000 # 0.394897460938 31 + .long 0x3ECF9000 # 0.405395507813 32 + .long 0x3ED4E000 # 0.415771484375 33 + .long 0x3EDA2000 # 0.426025390625 34 + .long 0x3EDF5000 # 0.436157226563 35 + .long 0x3EE47000 # 0.446166992188 36 + .long 0x3EE99000 # 0.456176757813 37 + .long 0x3EEEA000 # 0.466064453125 38 + .long 0x3EF3A000 # 0.475830078125 39 + .long 0x3EF89000 # 0.485473632813 40 + .long 0x3EFD7000 # 0.494995117188 41 + .long 0x3F012000 # 0.504394531250 42 + .long 0x3F039000 # 0.513916015625 43 + .long 0x3F05F000 # 0.523193359375 44 + .long 0x3F084000 # 0.532226562500 45 + .long 0x3F0AA000 # 0.541503906250 46 + .long 0x3F0CF000 # 0.550537109375 47 + .long 0x3F0F4000 # 0.559570312500 48 + .long 0x3F118000 # 0.568359375000 49 + .long 0x3F13C000 # 0.577148437500 50 + .long 0x3F160000 # 0.585937500000 51 + .long 0x3F183000 # 0.594482421875 52 + .long 0x3F1A7000 # 0.603271484375 53 + .long 0x3F1C9000 # 0.611572265625 54 + .long 0x3F1EC000 # 0.620117187500 55 + .long 0x3F20E000 # 0.628417968750 56 + .long 0x3F230000 # 0.636718750000 57 + .long 0x3F252000 # 0.645019531250 58 + .long 0x3F273000 # 0.653076171875 59 + .long 0x3F295000 # 0.661376953125 60 + .long 0x3F2B5000 # 0.669189453125 61 + .long 0x3F2D6000 # 0.677246093750 62 + .long 0x3F2F7000 # 0.685302734375 63 + .long 0x3F317000 # 0.693115234375 64 + .long 0 # for alignment + +.L__np_ln_tail_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x35A8B0FC # 0.000001256848 1 + .long 0x361B0E78 # 0.000002310522 2 + .long 0x3631EC66 # 0.000002651266 3 + .long 0x35C30046 # 0.000001452871 4 + .long 0x37EBCB0E # 0.000028108738 5 + .long 0x37528AE5 # 0.000012549314 6 + .long 0x36DA7496 # 0.000006510479 7 + .long 0x3783B715 # 0.000015701671 8 + .long 0x383F3E68 # 0.000045596069 9 + .long 0x38297C10 # 0.000040408282 10 + .long 0x3815B666 # 0.000035694240 11 + .long 0x38183854 # 0.000036292084 12 + .long 0x38448108 # 0.000046850211 13 + .long 0x373539E9 # 0.000010801924 14 + .long 0x3864A740 # 0.000054515200 15 + .long 0x387BE3CD # 0.000060055219 16 + .long 0x3803B715 # 0.000031403342 17 + .long 0x380C36AF # 0.000033429529 18 + .long 0x3892713A # 0.000069829126 19 + .long 0x38AE55D6 # 0.000083129547 20 + .long 0x38A0FDE8 # 0.000076766883 21 + .long 0x3862BAE1 # 0.000054056643 22 + .long 0x3798AAD3 # 0.000018199358 23 + .long 0x38C5E10E # 0.000094356117 24 + .long 0x382D872E # 0.000041372310 25 + .long 0x38DEDFAC # 0.000106274470 26 + .long 0x38481E9B # 0.000047712219 27 + .long 0x38EBFB5E # 0.000112524940 28 + .long 0x38783B83 # 0.000059183232 29 + .long 0x374E1B05 # 0.000012284848 30 + .long 0x38CA0E11 # 0.000096347307 31 + .long 0x3891F660 # 0.000069600297 32 + .long 0x386C9A9A # 0.000056410769 33 + .long 0x38777BCD # 0.000059004688 34 + .long 0x38A6CED4 # 0.000079540216 35 + .long 0x38FBE3CD # 0.000120110439 36 + .long 0x387E7E01 # 0.000060675669 37 + .long 0x37D40984 # 0.000025276800 38 + .long 0x3784C3AD # 0.000015826745 39 + .long 0x380F5FAF # 0.000034182969 40 + .long 0x38AC47BC # 0.000082149607 41 + .long 0x392952D3 # 0.000161479504 42 + .long 0x37F97073 # 0.000029735476 43 + .long 0x3865C84A # 0.000054784388 44 + .long 0x3979CF17 # 0.000238236375 45 + .long 0x38C3D2F5 # 0.000093376184 46 + .long 0x38E6B468 # 0.000110008579 47 + .long 0x383EBCE1 # 0.000045475437 48 + .long 0x39186BDF # 0.000145360347 49 + .long 0x392F0945 # 0.000166927537 50 + .long 0x38E9ED45 # 0.000111545007 51 + .long 0x396B99A8 # 0.000224685878 52 + .long 0x37A27674 # 0.000019367064 53 + .long 0x397069AB # 0.000229275480 54 + .long 0x39013539 # 0.000123222257 55 + .long 0x3947F423 # 0.000190690669 56 + .long 0x3945E10E # 0.000188712234 57 + .long 0x38F85DB0 # 0.000118430122 58 + .long 0x396C08DC # 0.000225100142 59 + .long 0x37B4996F # 0.000021529120 60 + .long 0x397CEADA # 0.000241200818 61 + .long 0x3920261B # 0.000152729845 62 + .long 0x35AA4906 # 0.000001268724 63 + .long 0x3805FDF4 # 0.000031946183 64 + .long 0 # for alignment + +
diff --git a/src/gas/vrs8logf.S b/src/gas/vrs8logf.S new file mode 100644 index 0000000..a5e7ed9 --- /dev/null +++ b/src/gas/vrs8logf.S
@@ -0,0 +1,904 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrs8logf.s +# +# A vector implementation of the logf libm function. +# This routine implemented in single precision. It is slightly +# less accurate than the double precision version, but it will +# be better for vectorizing. +# +# Prototype: +# +# __m128,__m128 __vrs8_logf(__m128 x1, __m128 x2); +# +# Computes the natural log of x for eight packed single values. +# Places the results into xmm0 and xmm1. +# Returns proper C99 values, but may not raise status flags properly. +# Less than 1 ulp of error. +# +# This array version is basically a unrolling of the by4 scalar single +# routine. The second set of operations is performed by the indented +# instructions interleaved into the first set. +# The scheduling is done by trial and error. The resulting code represents +# the best time of many variations. It would seem more interleaving could +# be done, as there is a long stretch of the second computation that is not +# interleaved. But moving any of this code forward makes the routine +# slower. +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + .text + .align 16 + .p2align 4,,15 + +# define local variable storage offsets +.equ p_x,0 # save x +.equ p_idx,0x010 # xmmword index +.equ p_z1,0x020 # xmmword index +.equ p_q,0x030 # xmmword index +.equ p_corr,0x040 # xmmword index +.equ p_omask,0x050 # xmmword index +.equ save_xmm6,0x060 # +.equ save_rbx,0x070 # +.equ save_xmm7,0x080 # +.equ save_xmm8,0x090 # +.equ save_xmm9,0x0a0 # +.equ save_xmm10,0x0b0 # +.equ save_xmm11,0x0c0 # +.equ save_xmm12,0x0d0 # +.equ save_xmm13,0x0d0 # +.equ p_x2,0x0100 # save x +.equ p_idx2,0x0110 # xmmword index +.equ p_z12,0x0120 # xmmword index +.equ p_q2,0x0130 # xmmword index + +.equ stack_size,0x0168 + + + +.globl __vrs8_logf + .type __vrs8_logf,@function +__vrs8_logf: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx + +# check e as a special case + movdqa %xmm0,p_x(%rsp) # save x + movdqa %xmm1,p_x2(%rsp) # save x + movdqa %xmm0,%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movmskps %xmm2,%r9d + + movdqa %xmm1,%xmm12 + movdqa %xmm1,%xmm9 + movaps %xmm1,%xmm7 + +# +# compute the index into the log tables +# + movdqa %xmm0,%xmm3 + movaps %xmm0,%xmm1 + psrld $23,%xmm3 + + # + # compute the index into the log tables + # + psrld $23,%xmm9 + subps .L__real_one(%rip),%xmm7 + psubd .L__mask_127(%rip),%xmm9 + subps .L__real_one(%rip),%xmm1 + psubd .L__mask_127(%rip),%xmm3 + cvtdq2ps %xmm9,%xmm13 # xexp + + movdqa %xmm12,%xmm9 + pand .L__real_mant(%rip),%xmm9 + xor %r8,%r8 + movdqa %xmm9,%xmm8 + movaps .L__real_half(%rip),%xmm11 # .5 + cvtdq2ps %xmm3,%xmm6 # xexp + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + xor %r8,%r8 + movdqa %xmm3,%xmm2 + movaps .L__real_half(%rip),%xmm5 # .5 + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrld $16,%xmm3 + lea .L__np_ln_lead_table(%rip),%rdx + movdqa %xmm3,%xmm4 + psrld $16,%xmm9 + movdqa %xmm9,%xmm10 + psrld $1,%xmm9 + psrld $1,%xmm3 + paddd .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm4 + paddd %xmm4,%xmm3 + cvtdq2ps %xmm3,%xmm1 + #/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + paddd .L__mask_040(%rip),%xmm9 + pand .L__mask_001(%rip),%xmm10 + paddd %xmm10,%xmm9 + cvtdq2ps %xmm9,%xmm7 + packssdw %xmm3,%xmm3 + movq %xmm3,p_idx(%rsp) + packssdw %xmm9,%xmm9 + movq %xmm9,p_idx2(%rsp) + + +# reduce and get u + movdqa %xmm0,%xmm3 + orps .L__real_half(%rip),%xmm2 + + + mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128 + # reduce and get u + + + subps %xmm1,%xmm2 # f2 = f - f1 + mulps %xmm2,%xmm5 + addps %xmm5,%xmm1 + + divps %xmm1,%xmm2 # u + + movdqa %xmm12,%xmm9 + orps .L__real_half(%rip),%xmm8 + + + mulps .L__real_3c000000(%rip),%xmm7 # f1 = index/128 + subps %xmm7,%xmm8 # f2 = f - f1 + mulps %xmm8,%xmm11 + addps %xmm11,%xmm7 + + + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1(%rsp) # save the f1 values + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1+8(%rsp) # save the f1 value + divps %xmm7,%xmm8 # u + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx2(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z12(%rsp) # save the f1 values + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z12+8(%rsp) # save the f1 value +# solve for ln(1+u) + movaps %xmm2,%xmm1 # u + mulps %xmm2,%xmm2 # u^2 + movaps %xmm2,%xmm5 + movaps .L__real_cb3(%rip),%xmm3 + mulps %xmm2,%xmm3 #Cu2 + mulps %xmm1,%xmm5 # u^3 + addps .L__real_cb2(%rip),%xmm3 #B+Cu2 + movaps %xmm2,%xmm4 + mulps %xmm5,%xmm4 # u^5 + movaps .L__real_log2_lead(%rip),%xmm2 + + mulps .L__real_cb1(%rip),%xmm5 #Au3 + addps %xmm5,%xmm1 # u+Au3 + mulps %xmm3,%xmm4 # u5(B+Cu2) + + lea .L__np_ln_tail_table(%rip),%rdx + addps %xmm4,%xmm1 # poly + +# recombine + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q(%rsp) # save the f2 value + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q+8(%rsp) # save the f2 value + + addps p_q(%rsp),%xmm1 #z2 +=q + + movaps p_z1(%rsp),%xmm0 # z1 values + + mulps %xmm6,%xmm2 + addps %xmm2,%xmm0 #r1 + mulps .L__real_log2_tail(%rip),%xmm6 + addps %xmm6,%xmm1 #r2 + addps %xmm1,%xmm0 + + + +# check for e + test $0x0f,%r9d + jnz .L__vlogf_e +.L__f1: + +# check for negative numbers or zero + xorps %xmm1,%xmm1 + cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also. + movmskps %xmm1,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg + +.L__f2: +## if +inf + movaps p_x(%rsp),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__log_inf +.L__f3: + + movaps p_x(%rsp),%xmm3 + subps .L__real_one(%rip),%xmm3 + andps .L__real_notsign(%rip),%xmm3 + cmpps $2,.L__real_threshold(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__near_one +.L__f4: + +# finish the second set of calculations + + # solve for ln(1+u) + movaps %xmm8,%xmm7 # u + mulps %xmm8,%xmm8 # u^2 + movaps %xmm8,%xmm11 + + movaps .L__real_cb3(%rip),%xmm9 + mulps %xmm8,%xmm9 #Cu2 + mulps %xmm7,%xmm11 # u^3 + addps .L__real_cb2(%rip),%xmm9 #B+Cu2 + movaps %xmm8,%xmm10 + mulps %xmm11,%xmm10 # u^5 + movaps .L__real_log2_lead(%rip),%xmm8 + + mulps .L__real_cb1(%rip),%xmm11 #Au3 + addps %xmm11,%xmm7 # u+Au3 + mulps %xmm9,%xmm10 # u5(B+Cu2) + addps %xmm10,%xmm7 # poly + + + # recombine + lea .L__np_ln_tail_table(%rip),%rdx + mov p_idx2(%rsp),%rcx # get the indexes + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q2(%rsp) # save the f2 value + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q2+8(%rsp) # save the f2 value + addps p_q2(%rsp),%xmm7 #z2 +=q + movaps p_z12(%rsp),%xmm1 # z1 values + + mulps %xmm13,%xmm8 + addps %xmm8,%xmm1 #r1 + mulps .L__real_log2_tail(%rip),%xmm13 + addps %xmm13,%xmm7 #r2 + addps %xmm7,%xmm1 + + # check e as a special case + movaps p_x2(%rsp),%xmm10 + cmpps $0,.L__real_ef(%rip),%xmm10 + movmskps %xmm10,%r9d + # check for e + test $0x0f,%r9d + jnz .L__vlogf_e2 +.L__f12: + + # check for negative numbers or zero + xorps %xmm7,%xmm7 + cmpps $1,p_x2(%rsp),%xmm7 # 0 greater than =?. catches NaNs also. + movmskps %xmm7,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg2 + +.L__f22: + ## if +inf + movaps p_x2(%rsp),%xmm9 + cmpps $0,.L__real_inf(%rip),%xmm9 + movmskps %xmm9,%r9d + test $0x0f,%r9d + jnz .L__log_inf2 +.L__f32: + + movaps p_x2(%rsp),%xmm9 + subps .L__real_one(%rip),%xmm9 + andps .L__real_notsign(%rip),%xmm9 + cmpps $2,.L__real_threshold(%rip),%xmm9 + movmskps %xmm9,%r9d + test $0x0f,%r9d + jnz .L__near_one2 +.L__f42: + + +.L__finish: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + +.L__vlogf_e: + movdqa p_x(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm0,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + jmp .L__f1 + +.L__vlogf_e2: + movdqa p_x2(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm1,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + jmp .L__f12 + + .align 16 +.L__near_one: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm3,p_omask(%rsp) # save ones mask + movaps p_x(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r +# u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm1 + divps %xmm2,%xmm1 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm1,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction +# u = u + u; + addps %xmm1,%xmm1 #u + movaps %xmm1,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm1,%xmm5 # Cu + movaps %xmm1,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm1 + mulps %xmm1,%xmm1 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm1 #u6(Cu+Du3) + addps %xmm1,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + +# return r + r2; + addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm0,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + + jmp .L__f4 + + + .align 16 +.L__near_one2: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm9,p_omask(%rsp) # save ones mask + movaps p_x2(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r + # u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm7 + divps %xmm2,%xmm7 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C + # correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm7,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction + # u = u + u; + addps %xmm7,%xmm7 #u + movaps %xmm7,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 + # r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm7,%xmm5 # Cu + movaps %xmm7,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm7 + mulps %xmm7,%xmm7 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm7 #u6(Cu+Du3) + addps %xmm7,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + + # return r + r2; + addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm1,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + + jmp .L__f42 + +# we have a zero, a negative number, or both. +# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf. +.L__z_or_neg: +# deal with negatives first + movdqa %xmm1,%xmm3 + andps %xmm0,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm1 # setup the nan values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace +# check for +/- 0 + xorps %xmm1,%xmm1 + cmpps $0,p_x(%rsp),%xmm1 # 0 ?. + movmskps %xmm1,%r9d + test $0x0f,%r9d + jz .L__zn2 + + movdqa %xmm1,%xmm3 + andnps %xmm0,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + +.L__zn2: +# check for NaNs + movaps p_x(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x(%rsp),%xmm1 # isolate the NaNs + pand %xmm4,%xmm1 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm1,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm1 + andnps %xmm0,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f2 + +# handle only +inf log(+inf) = inf +.L__log_inf: + movdqa %xmm3,%xmm1 + andnps %xmm0,%xmm3 # keep the non-error values + andps p_x(%rsp),%xmm1 # setup the +inf values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + jmp .L__f3 + + +.L__z_or_neg2: + # deal with negatives first + movdqa %xmm7,%xmm3 + andps %xmm1,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm7 # setup the nan values + orps %xmm3,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + # check for +/- 0 + xorps %xmm7,%xmm7 + cmpps $0,p_x2(%rsp),%xmm7 # 0 ?. + movmskps %xmm7,%r9d + test $0x0f,%r9d + jz .L__zn22 + + movdqa %xmm7,%xmm3 + andnps %xmm1,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm7 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + +.L__zn22: + # check for NaNs + movaps p_x2(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x2(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x2(%rsp),%xmm7 # isolate the NaNs + pand %xmm4,%xmm7 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm7,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm7 + andnps %xmm1,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f22 + + # handle only +inf log(+inf) = inf +.L__log_inf2: + movdqa %xmm9,%xmm7 + andnps %xmm1,%xmm9 # keep the non-error values + andps p_x2(%rsp),%xmm7 # setup the +inf values + orps %xmm9,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + jmp .L__f32 + + + .data + .align 64 + +.L__real_zero: .quad 0x00000000000000000 # 1.0 + .quad 0x00000000000000000 +.L__real_one: .quad 0x03f8000003f800000 # 1.0 + .quad 0x03f8000003f800000 +.L__real_two: .quad 0x04000000040000000 # 1.0 + .quad 0x04000000040000000 +.L__real_ninf: .quad 0x0ff800000ff800000 # -inf + .quad 0x0ff800000ff800000 +.L__real_inf: .quad 0x07f8000007f800000 # +inf + .quad 0x07f8000007f800000 +.L__real_nan: .quad 0x07fc000007fc00000 # NaN + .quad 0x07fc000007fc00000 +.L__real_ef: .quad 0x0402DF854402DF854 # float e + .quad 0x0402DF854402DF854 + +.L__real_sign: .quad 0x08000000080000000 # sign bit + .quad 0x08000000080000000 +.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit + .quad 0x07ffFFFFF7ffFFFFF +.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit + .quad 0x00040000000400000 +.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits + .quad 0x0007FFFFF007FFFFF +.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */ + .quad 0x03c0000003c000000 +.L__mask_127: .quad 0x00000007f0000007f # + .quad 0x00000007f0000007f +.L__mask_040: .quad 0x00000004000000040 # + .quad 0x00000004000000040 +.L__mask_001: .quad 0x00000000100000001 # + .quad 0x00000000100000001 + + +.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03 + .quad 0x03CF5C28F3CF5C28F + +.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03 + .quad 0x03B1249183B124918 +.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04 + .quad 0x039E401A639E401A6 +.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03 + .quad 0x03B124A123B124A12 +.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375 + .quad 0x03F3170003F317000 +.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183 + .quad 0x03805FDF43805FDF4 +.L__real_half: .quad 0x03f0000003f000000 # 1/2 + .quad 0x03f0000003f000000 + + +.L__np_ln__table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02 + .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02 + .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02 + .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02 + .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02 + .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02 + .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01 + .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01 + .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01 + .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01 + .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01 + .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01 + .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01 + .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01 + .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01 + .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01 + .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01 + .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01 + .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01 + .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01 + .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01 + .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01 + .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01 + .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01 + .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01 + .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01 + .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01 + .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01 + .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01 + .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01 + .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01 + .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01 + .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01 + .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01 + .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01 + .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01 + .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01 + .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01 + .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01 + .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01 + .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01 + .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01 + .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01 + .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01 + .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01 + .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01 + .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01 + .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01 + .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01 + .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01 + .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01 + .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01 + .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01 + .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01 + .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01 + .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01 + .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01 + .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01 + .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01 + .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01 + .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01 + .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01 + .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01 + .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_lead_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x3C7E0000 # 0.015502929688 1 + .long 0x3CFC1000 # 0.030769348145 2 + .long 0x3D3BA000 # 0.045806884766 3 + .long 0x3D785000 # 0.060623168945 4 + .long 0x3D9A0000 # 0.075195312500 5 + .long 0x3DB78000 # 0.089599609375 6 + .long 0x3DD49000 # 0.103790283203 7 + .long 0x3DF13000 # 0.117767333984 8 + .long 0x3E06B000 # 0.131530761719 9 + .long 0x3E14A000 # 0.145141601563 10 + .long 0x3E226000 # 0.158569335938 11 + .long 0x3E2FF000 # 0.171813964844 12 + .long 0x3E3D5000 # 0.184875488281 13 + .long 0x3E4A9000 # 0.197814941406 14 + .long 0x3E579000 # 0.210510253906 15 + .long 0x3E647000 # 0.223083496094 16 + .long 0x3E713000 # 0.235534667969 17 + .long 0x3E7DC000 # 0.247802734375 18 + .long 0x3E851000 # 0.259887695313 19 + .long 0x3E8B3000 # 0.271850585938 20 + .long 0x3E914000 # 0.283691406250 21 + .long 0x3E974000 # 0.295410156250 22 + .long 0x3E9D3000 # 0.307006835938 23 + .long 0x3EA30000 # 0.318359375000 24 + .long 0x3EA8D000 # 0.329711914063 25 + .long 0x3EAE8000 # 0.340820312500 26 + .long 0x3EB43000 # 0.351928710938 27 + .long 0x3EB9C000 # 0.362792968750 28 + .long 0x3EBF5000 # 0.373657226563 29 + .long 0x3EC4D000 # 0.384399414063 30 + .long 0x3ECA3000 # 0.394897460938 31 + .long 0x3ECF9000 # 0.405395507813 32 + .long 0x3ED4E000 # 0.415771484375 33 + .long 0x3EDA2000 # 0.426025390625 34 + .long 0x3EDF5000 # 0.436157226563 35 + .long 0x3EE47000 # 0.446166992188 36 + .long 0x3EE99000 # 0.456176757813 37 + .long 0x3EEEA000 # 0.466064453125 38 + .long 0x3EF3A000 # 0.475830078125 39 + .long 0x3EF89000 # 0.485473632813 40 + .long 0x3EFD7000 # 0.494995117188 41 + .long 0x3F012000 # 0.504394531250 42 + .long 0x3F039000 # 0.513916015625 43 + .long 0x3F05F000 # 0.523193359375 44 + .long 0x3F084000 # 0.532226562500 45 + .long 0x3F0AA000 # 0.541503906250 46 + .long 0x3F0CF000 # 0.550537109375 47 + .long 0x3F0F4000 # 0.559570312500 48 + .long 0x3F118000 # 0.568359375000 49 + .long 0x3F13C000 # 0.577148437500 50 + .long 0x3F160000 # 0.585937500000 51 + .long 0x3F183000 # 0.594482421875 52 + .long 0x3F1A7000 # 0.603271484375 53 + .long 0x3F1C9000 # 0.611572265625 54 + .long 0x3F1EC000 # 0.620117187500 55 + .long 0x3F20E000 # 0.628417968750 56 + .long 0x3F230000 # 0.636718750000 57 + .long 0x3F252000 # 0.645019531250 58 + .long 0x3F273000 # 0.653076171875 59 + .long 0x3F295000 # 0.661376953125 60 + .long 0x3F2B5000 # 0.669189453125 61 + .long 0x3F2D6000 # 0.677246093750 62 + .long 0x3F2F7000 # 0.685302734375 63 + .long 0x3F317000 # 0.693115234375 64 + .long 0 # for alignment + +.L__np_ln_tail_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x35A8B0FC # 0.000001256848 1 + .long 0x361B0E78 # 0.000002310522 2 + .long 0x3631EC66 # 0.000002651266 3 + .long 0x35C30046 # 0.000001452871 4 + .long 0x37EBCB0E # 0.000028108738 5 + .long 0x37528AE5 # 0.000012549314 6 + .long 0x36DA7496 # 0.000006510479 7 + .long 0x3783B715 # 0.000015701671 8 + .long 0x383F3E68 # 0.000045596069 9 + .long 0x38297C10 # 0.000040408282 10 + .long 0x3815B666 # 0.000035694240 11 + .long 0x38183854 # 0.000036292084 12 + .long 0x38448108 # 0.000046850211 13 + .long 0x373539E9 # 0.000010801924 14 + .long 0x3864A740 # 0.000054515200 15 + .long 0x387BE3CD # 0.000060055219 16 + .long 0x3803B715 # 0.000031403342 17 + .long 0x380C36AF # 0.000033429529 18 + .long 0x3892713A # 0.000069829126 19 + .long 0x38AE55D6 # 0.000083129547 20 + .long 0x38A0FDE8 # 0.000076766883 21 + .long 0x3862BAE1 # 0.000054056643 22 + .long 0x3798AAD3 # 0.000018199358 23 + .long 0x38C5E10E # 0.000094356117 24 + .long 0x382D872E # 0.000041372310 25 + .long 0x38DEDFAC # 0.000106274470 26 + .long 0x38481E9B # 0.000047712219 27 + .long 0x38EBFB5E # 0.000112524940 28 + .long 0x38783B83 # 0.000059183232 29 + .long 0x374E1B05 # 0.000012284848 30 + .long 0x38CA0E11 # 0.000096347307 31 + .long 0x3891F660 # 0.000069600297 32 + .long 0x386C9A9A # 0.000056410769 33 + .long 0x38777BCD # 0.000059004688 34 + .long 0x38A6CED4 # 0.000079540216 35 + .long 0x38FBE3CD # 0.000120110439 36 + .long 0x387E7E01 # 0.000060675669 37 + .long 0x37D40984 # 0.000025276800 38 + .long 0x3784C3AD # 0.000015826745 39 + .long 0x380F5FAF # 0.000034182969 40 + .long 0x38AC47BC # 0.000082149607 41 + .long 0x392952D3 # 0.000161479504 42 + .long 0x37F97073 # 0.000029735476 43 + .long 0x3865C84A # 0.000054784388 44 + .long 0x3979CF17 # 0.000238236375 45 + .long 0x38C3D2F5 # 0.000093376184 46 + .long 0x38E6B468 # 0.000110008579 47 + .long 0x383EBCE1 # 0.000045475437 48 + .long 0x39186BDF # 0.000145360347 49 + .long 0x392F0945 # 0.000166927537 50 + .long 0x38E9ED45 # 0.000111545007 51 + .long 0x396B99A8 # 0.000224685878 52 + .long 0x37A27674 # 0.000019367064 53 + .long 0x397069AB # 0.000229275480 54 + .long 0x39013539 # 0.000123222257 55 + .long 0x3947F423 # 0.000190690669 56 + .long 0x3945E10E # 0.000188712234 57 + .long 0x38F85DB0 # 0.000118430122 58 + .long 0x396C08DC # 0.000225100142 59 + .long 0x37B4996F # 0.000021529120 60 + .long 0x397CEADA # 0.000241200818 61 + .long 0x3920261B # 0.000152729845 62 + .long 0x35AA4906 # 0.000001268724 63 + .long 0x3805FDF4 # 0.000031946183 64 + .long 0 # for alignment + +
diff --git a/src/gas/vrsacosf.S b/src/gas/vrsacosf.S new file mode 100644 index 0000000..1620009 --- /dev/null +++ b/src/gas/vrsacosf.S
@@ -0,0 +1,2291 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrsacosf.s +# +# A vector implementation of the cos libm function. +# +# Prototype: +# +# vrsa_cosf(int n, float* x, float* y); +# +# Computes Cosine of x for an array of input values. +# Places the results into the supplied y array. +# Does not perform error checking. +# Denormal inputs may produce unexpected results. +# This inlines a routine that computes 4 single precision Cosine values at a time. +# The four values are passed as packed single in xmm10. +# The four results are returned as packed singles in xmm10. +# Note that this represents a non-standard ABI usage, as no ABI +# ( and indeed C) currently allows returning 2 values for a function. +# It is expected that some compilers may be able to take advantage of this +# interface when implementing vectorized loops. Using the array implementation +# of the routine requires putting the inputs into memory, and retrieving +# the results from memory. This routine eliminates the need for this +# overhead if the data does not already reside in memory. +# Author: Harsha Jagasia +# Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 64 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # + +.Lcosarray: + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x03FA5555555502F31 + .quad 0x0BF56C16BF55699D7 # -0.00138889 c2 + .quad 0x0BF56C16BF55699D7 + .quad 0x03EFA015C50A93B49 # 2.48016e-005 c3 + .quad 0x03EFA015C50A93B49 + .quad 0x0BE92524743CC46B8 # -2.75573e-007 c4 + .quad 0x0BE92524743CC46B8 + +.Lsinarray: + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x0BFC555555545E87D + .quad 0x03F811110DF01232D # 0.00833333 s2 + .quad 0x03F811110DF01232D + .quad 0x0BF2A013A88A37196 # -0.000198413 s3 + .quad 0x0BF2A013A88A37196 + .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4 + .quad 0x03EC6DBE4AD1572D5 + +.Lsincosarray: + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x03F811110DF01232D # 0.00833333 s2 + .quad 0x0BF56C16BF55699D7 + .quad 0x0BF2A013A88A37196 # -0.000198413 s3 + .quad 0x03EFA015C50A93B49 + .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4 + .quad 0x0BE92524743CC46B8 + +.Lcossinarray: + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x0BF56C16BF55699D7 # c2 + .quad 0x03F811110DF01232D + .quad 0x03EFA015C50A93B49 # c3 + .quad 0x0BF2A013A88A37196 + .quad 0x0BE92524743CC46B8 # c4 + .quad 0x03EC6DBE4AD1572D5 + + +.align 8 + .Levencos_oddsin_tbl: + .quad .Lcoscos_coscos_piby4 # 0 * ; Done + .quad .Lcoscos_cossin_piby4 # 1 + ; Done + .quad .Lcoscos_sincos_piby4 # 2 ; Done + .quad .Lcoscos_sinsin_piby4 # 3 + ; Done + + .quad .Lcossin_coscos_piby4 # 4 ; Done + .quad .Lcossin_cossin_piby4 # 5 * ; Done + .quad .Lcossin_sincos_piby4 # 6 ; Done + .quad .Lcossin_sinsin_piby4 # 7 ; Done + + .quad .Lsincos_coscos_piby4 # 8 ; Done + .quad .Lsincos_cossin_piby4 # 9 ; TBD + .quad .Lsincos_sincos_piby4 # 10 * ; Done + .quad .Lsincos_sinsin_piby4 # 11 ; Done + + .quad .Lsinsin_coscos_piby4 # 12 ; Done + .quad .Lsinsin_cossin_piby4 # 13 + ; Done + .quad .Lsinsin_sincos_piby4 # 14 ; Done + .quad .Lsinsin_sinsin_piby4 # 15 * ; Done + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + .weak vrsa_cosf_ + .set vrsa_cosf_,__vrsa_cosf__ + .weak vrsa_cosf__ + .set vrsa_cosf__,__vrsa_cosf__ + + .text + .align 16 + .p2align 4,,15 + +#FORTRAN subroutine implementation of array cos +#VRSA_COSF(N,X,Y) +#C equivalent*/ +#void vrsa_cosf__(int * n, double *x, double *y) +#{ +# vrsa_cosf(*n,x,y); +#} + +.globl __vrsa_cosf__ + .type __vrsa_cosf__,@function +__vrsa_cosf__: + mov (%rdi),%edi + + .align 16 + .p2align 4,,15 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# define local variable storage offsets +.equ p_temp,0 # temporary for get/put bits operation +.equ p_temp1,0x10 # temporary for get/put bits operation + +.equ save_xmm6,0x20 # temporary for get/put bits operation +.equ save_xmm7,0x30 # temporary for get/put bits operation +.equ save_xmm8,0x40 # temporary for get/put bits operation +.equ save_xmm9,0x50 # temporary for get/put bits operation +.equ save_xmm0,0x60 # temporary for get/put bits operation +.equ save_xmm11,0x70 # temporary for get/put bits operation +.equ save_xmm12,0x80 # temporary for get/put bits operation +.equ save_xmm13,0x90 # temporary for get/put bits operation +.equ save_xmm14,0x0A0 # temporary for get/put bits operation +.equ save_xmm15,0x0B0 # temporary for get/put bits operation + +.equ r,0x0C0 # pointer to r for remainder_piby2 +.equ rr,0x0D0 # pointer to r for remainder_piby2 +.equ region,0x0E0 # pointer to r for remainder_piby2 + +.equ r1,0x0F0 # pointer to r for remainder_piby2 +.equ rr1,0x0100 # pointer to r for remainder_piby2 +.equ region1,0x0110 # pointer to r for remainder_piby2 + +.equ p_temp2,0x0120 # temporary for get/put bits operation +.equ p_temp3,0x0130 # temporary for get/put bits operation + +.equ p_temp4,0x0140 # temporary for get/put bits operation +.equ p_temp5,0x0150 # temporary for get/put bits operation + +.equ p_original,0x0160 # original x +.equ p_mask,0x0170 # original x +.equ p_sign,0x0180 # original x + +.equ p_original1,0x0190 # original x +.equ p_mask1,0x01A0 # original x +.equ p_sign1,0x01B0 # original x + +.equ save_r12,0x01C0 # temporary for get/put bits operation +.equ save_r13,0x01D0 # temporary for get/put bits operation + +.equ save_xa,0x01E0 #qword +.equ save_ya,0x01F0 #qword + +.equ save_nv,0x0200 #qword +.equ p_iter,0x0210 # qword storage for number of loop iterations + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +.globl vrsa_cosf + .type vrsa_cosf,@function +vrsa_cosf: + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# parameters are passed in by Linux as: +# rcx - int n +# rdx - double *x +# r8 - double *y + + + sub $0x0228,%rsp + mov %r12,save_r12(%rsp) # save r12 + mov %r13,save_r13(%rsp) # save r13 + + + + + +#START PROCESS INPUT +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + mov %rdi,save_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vrsa_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#START LOOP +.align 16 +.L__vrsa_top: +# build the input _m128d + mov save_xa(%rsp),%rsi # get x_array pointer + movlps (%rsi),%xmm0 + movhps 8(%rsi),%xmm0 + + prefetch 32(%rsi) + add $16,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# V4 START + + movhlps %xmm0,%xmm8 + cvtps2pd %xmm0,%xmm10 # convert input to double. + cvtps2pd %xmm8,%xmm1 # convert input to double. + +movdqa %xmm10,%xmm6 +movdqa %xmm1,%xmm7 +movapd .L__real_7fffffffffffffff(%rip),%xmm2 + +andpd %xmm2,%xmm10 #Unsign +andpd %xmm2,%xmm1 #Unsign + +movd %xmm10,%rax #rax is lower arg +movhpd %xmm10, p_temp+8(%rsp) # +mov p_temp+8(%rsp),%rcx #rcx = upper arg + +movd %xmm1,%r8 #r8 is lower arg +movhpd %xmm1, p_temp1+8(%rsp) # +mov p_temp1+8(%rsp),%r9 #r9 = upper arg + +movdqa %xmm10,%xmm12 +movdqa %xmm1,%xmm13 + +pcmpgtd %xmm6,%xmm12 +pcmpgtd %xmm7,%xmm13 +movdqa %xmm12,%xmm6 +movdqa %xmm13,%xmm7 +psrldq $4,%xmm12 +psrldq $4,%xmm13 +psrldq $8,%xmm6 +psrldq $8,%xmm7 + +mov $0x3FE921FB54442D18,%rdx #piby4 + +mov $0x411E848000000000,%r10 #5e5 + +movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + + +por %xmm6,%xmm12 +por %xmm7,%xmm13 + +movapd %xmm10,%xmm2 #x0 +movapd %xmm1,%xmm3 #x1 +movapd %xmm10,%xmm6 #x0 +movapd %xmm1,%xmm7 #x1 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm2 = x, xmm4 =0.5/t, xmm6 =x +# xmm3 = x, xmm5 =0.5/t, xmm7 =x +.align 16 +.Leither_or_both_arg_gt_than_piby4: + cmp %r10,%rax + jae .Lfirst_or_next3_arg_gt_5e5 + + cmp %r10,%rcx + jae .Lsecond_or_next2_arg_gt_5e5 + + cmp %r10,%r8 + jae .Lthird_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfourth_arg_gt_5e5 + + +# /* Find out what multiple of piby2 */ +# npi2 = (int)(x * twobypi + 0.5); + movapd .L__real_3fe45f306dc9c883(%rip),%xmm10 + mulpd %xmm10,%xmm2 # * twobypi + mulpd %xmm10,%xmm3 # * twobypi + + addpd %xmm4,%xmm2 # +0.5, npi2 + addpd %xmm4,%xmm3 # +0.5, npi2 + + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + + cvtdq2pd %xmm4,%xmm2 # and back to double. + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + + movd %xmm4,%r8 # Region + movd %xmm5,%r9 # Region + + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + mov %r8,%r10 + mov %r9,%r11 + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm1,%xmm7 # t-rhead + + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = npi2 (int), xmm10 =rhead, xmm8 =rtail, r8 = region, r10 = region, r12 = Sign +# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail, r9 = region, r11 = region, r13 = Sign + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + + mov %r10,%rax + mov %r11,%rcx + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + xor %rax,%r10 + xor %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r12,p_sign+8(%rsp) #write out upper sign bit + mov %r11,p_sign1(%rsp) #write out lower sign bit + mov %r13,p_sign1+8(%rsp) #write out upper sign bit + + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead +# xmm4 = Sign, xmm10 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail + movapd %xmm10,%xmm6 # rhead + movapd %xmm1,%xmm7 # rhead + + subpd %xmm8,%xmm10 # r = rhead - rtail + subpd %xmm9,%xmm1 # r = rhead - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm10,%xmm2 # move r for r2 + movapd %xmm1,%xmm3 # move r for r2 + + mulpd %xmm10,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + leaq .Levencos_oddsin_tbl(%rip),%rcx + jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfirst_or_next3_arg_gt_5e5: +# %rcx,,%rax r8, r9 + + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Be sure not to use %xmm3,%xmm1 and xmm7 +# Use %xmm8,,%xmm5 xmm0, xmm12 +# %xmm11,,%xmm9 xmm13 + + + movlpd %xmm10,r(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm10,%xmm10 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm6,%xmm6 + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1 + cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm6,%xmm10 + subsd %xmm0,%xmm10 # xmm10 = r=(rhead-rtail) + subsd %xmm10,%xmm6 # rr=rhead-r + subsd %xmm0,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm10,r+8(%rsp) # store upper r + movlpd %xmm6,rr+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_lower_naninf + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea r(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrs4_cosf_lower_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region(%rsp) # region =0 + +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + + movhlps %xmm10,%xmm6 #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_lower_naninf_of_both_gt_5e5 + + mov %rcx,p_temp(%rsp) #Save upper arg + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea r(%rsp),%rsi + +# added ins- changed input from xmm10 to xmm0 + movd %xmm10,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + mov p_temp(%rsp),%rcx #Restore upper arg + jmp 0f + +.L__vrs4_cosf_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_upper_naninf_of_both_gt_5e5 + + + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea r+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm6,%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrs4_cosf_upper_naninf_of_both_gt_5e5: + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region+4(%rsp) #region = 0 + +.align 16 +0: + jmp .Lcheck_next2_args + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsecond_or_next2_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Restore xmm4 and %xmm3,,%xmm1 xmm7 +# Can use %xmm0,,%xmm8 xmm12 +# %xmm9,,%xmm5 xmm11, xmm13 + + movhpd %xmm10,r+8(%rsp) #Save upper fp arg for remainder_piby2 call + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1 + cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm1 = piby2_2 + cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store upper region + + subsd %xmm0,%xmm6 # xmm10 = r=(rhead-rtail) + + movlpd %xmm6,r(%rsp) # store upper r + + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_upper_naninf + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea r+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrs4_cosf_upper_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region+4(%rsp) # region =0 + +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcheck_next2_args: + + mov $0x411E848000000000,%r10 #5e5 + + + cmp %r10,%r8 + jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfirst_second_done_fourth_arg_gt_5e5 + + + +# Work on next two args, both < 5e5 +# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5 + + movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi + addpd %xmm4,%xmm3 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + cvtdq2pd %xmm5,%xmm3 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm5,region1(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm1,%xmm7 # t-rhead + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + subpd %xmm9,%xmm1 # r = rhead - rtail + movapd %xmm1,r1(%rsp) + + jmp .L__vrs4_cosf_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lthird_or_fourth_arg_gt_5e5: +#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Can use %xmm11,,%xmm9 xmm13 +# %xmm8,,%xmm5 xmm0, xmm12 +# Restore xmm4 + +# Work on first two args, both < 5e5 + + + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm4,region(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + subpd %xmm8,%xmm10 # r = rhead - rtail + movapd %xmm10,r(%rsp) + +# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_third_or_fourth_arg_gt_5e5: +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + + + mov $0x411E848000000000,%r10 #5e5 + + cmp %r10,%r9 + jae .Lboth_arg_gt_5e5_higher + + +# Upper Arg is <5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg + movhlps %xmm3,%xmm3 + movhlps %xmm7,%xmm7 + + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r9d,region1+4(%rsp) # store upper region + + subsd %xmm10,%xmm7 # xmm1 = r=(rhead-rtail) + + movlpd %xmm7,r1+8(%rsp) # store upper r + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_lower_naninf_higher + + lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf + lea r1(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r1(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_cosf_lower_naninf_higher: + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region1(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrs4_cosf_reconstruct + + + + + + + +.align 16 +.Lboth_arg_gt_5e5_higher: +# Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + + movhlps %xmm1,%xmm7 #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_lower_naninf_of_both_gt_5e5_higher + + mov %r9,p_temp1(%rsp) #Save upper arg + lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf + lea r1(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm1,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp1(%rsp),%r9 #Restore upper arg + + jmp 0f + +.L__vrs4_cosf_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region1(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_upper_naninf_of_both_gt_5e5_higher + + lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea r1+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm7,%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_cosf_upper_naninf_of_both_gt_5e5_higher: + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region1+4(%rsp) #region = 0 + +.align 16 +0: + + jmp .L__vrs4_cosf_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfourth_arg_gt_5e5: +#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5 +#%rcx,,%rax r8, r9 +#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + +# Work on first two args, both < 5e5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm4,region(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + + subpd %xmm8,%xmm10 # r = rhead - rtail + movapd %xmm10,r(%rsp) + +# Work on next two args, third arg < 5e5, fourth arg >= 5e5 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_fourth_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r8d,region1(%rsp) # store lower region + + subsd %xmm10,%xmm7 # xmm10 = r=(rhead-rtail) + + movlpd %xmm7,r1(%rsp) # store upper r + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrs4_cosf_upper_naninf_higher + + lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea r1+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r1+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_cosf_upper_naninf_higher: + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region1+4(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrs4_cosf_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrs4_cosf_reconstruct: +#Results +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd r(%rsp),%xmm10 + movapd r1(%rsp),%xmm1 + + mov region(%rsp),%r8 + mov region1(%rsp),%r9 + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + + mov %r8,%r10 + mov %r9,%r11 + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + mov %r10,%rax + mov %r11,%rcx + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + xor %rax,%r10 + xor %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r12,p_sign+8(%rsp) #write out upper sign bit + mov %r11,p_sign1(%rsp) #write out lower sign bit + mov %r13,p_sign1+8(%rsp) #write out upper sign bit + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm10,%xmm2 + movapd %xmm1,%xmm3 + + mulpd %xmm10,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + leaq .Levencos_oddsin_tbl(%rip),%rcx + jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + + + + + + + + + + + + + + + + + + + + + + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrsa_cosf_cleanup: + + movapd p_sign(%rsp),%xmm10 + movapd p_sign1(%rsp),%xmm1 + xorpd %xmm4,%xmm10 # (+) Sign + xorpd %xmm5,%xmm1 # (+) Sign + + cvtpd2ps %xmm10,%xmm0 + cvtpd2ps %xmm1,%xmm11 + movlhps %xmm11,%xmm0 + +# NEW + +.L__vrsa_bottom1: +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + movlps %xmm0,(%rdi) + movhps %xmm0,8(%rdi) + + prefetch 32(%rdi) + add $16,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vrsa_top + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vrsa_cleanup + +.L__final_check: + +# NEW + + mov save_r12(%rsp),%r12 # restore r12 + mov save_r13(%rsp),%r13 # restore r13 + + add $0x0228,%rsp + ret + +#NEW + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# we jump here when we have an odd number of cos calls to make at the end +# we assume that rdx is pointing at the next x array element, r8 at the next y array element. +# The number of values left is in save_nv + +.align 16 +.L__vrsa_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__final_check # exit if not + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + + +# START WORKING FROM HERE +# fill in a m128d with zeroes and the extra values and then make a recursive call. + xorps %xmm0,%xmm0 + movss %xmm0,p_temp+4(%rsp) + movlps %xmm0,p_temp+8(%rsp) + + + mov (%rsi),%ecx # we know there's at least one + mov %ecx,p_temp(%rsp) + cmp $2,%rax + jl .L__vrsacg + + mov 4(%rsi),%ecx # do the second value + mov %ecx,p_temp+4(%rsp) + cmp $3,%rax + jl .L__vrsacg + + mov 8(%rsi),%ecx # do the third value + mov %ecx,p_temp+8(%rsp) + +.L__vrsacg: + mov $4,%rdi # parameter for N + lea p_temp(%rsp),%rsi # &x parameter + lea p_temp2(%rsp),%rdx # &y parameter + call vrsa_cosf@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + + mov p_temp2(%rsp),%ecx + mov %ecx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vrsacgf + + mov p_temp2+4(%rsp),%ecx + mov %ecx,4(%rdi) # do the second value + cmp $3,%rax + jl .L__vrsacgf + + mov p_temp2+8(%rsp),%ecx + mov %ecx,8(%rdi) # do the third value + +.L__vrsacgf: + jmp .L__final_check + +#NEW + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_coscos_piby4: + movapd %xmm2,%xmm0 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + + movapd .Lcosarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4 + + addpd .Lcosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcosarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm2,%xmm4 # x4(c3+x2c4) + mulpd %xmm3,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + + subpd %xmm0,%xmm4 # + t + subpd %xmm11,%xmm5 # + t + + jmp .L__vrsa_cosf_cleanup + +.align 16 +.Lcossin_cossin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movdqa .Lsincosarray+0x30(%rip),%xmm4 # s4 + movdqa .Lsincosarray+0x30(%rip),%xmm5 # s4 + movapd .Lsincosarray+0x10(%rip),%xmm8 # s2 + movapd .Lsincosarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s4 + mulpd %xmm3,%xmm5 # x2s4 + mulpd %xmm2,%xmm8 # x2s2 + mulpd %xmm3,%xmm9 # x2s2 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # s4+x2s3 + addpd .Lsincosarray+0x20(%rip),%xmm5 # s4+x2s3 + addpd .Lsincosarray(%rip),%xmm8 # s2+x2s1 + addpd .Lsincosarray(%rip),%xmm9 # s2+x2s1 + + mulpd %xmm0,%xmm4 # x4(s3+x2s4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + movhlps %xmm0,%xmm0 # move high x4 for cos term + movhlps %xmm11,%xmm11 # move high x4 for cos term + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term + movsd %xmm3,%xmm7 # move low x2 for x3 for sin term + mulsd %xmm10,%xmm6 # get low x3 for sin term + mulsd %xmm1,%xmm7 # get low x3 for sin term + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high r for cos + movhlps %xmm3,%xmm13 # move high r for cos + + movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos + movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm7,%xmm5 # sin *x3 + + mulsd %xmm0,%xmm8 # cos *x4 + mulsd %xmm11,%xmm9 # cos *x4 + + subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0 + + addsd %xmm10,%xmm4 # sin + x + addsd %xmm1,%xmm5 # sin + x + subsd %xmm12,%xmm8 # cos+t + subsd %xmm13,%xmm9 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + + jmp .L__vrsa_cosf_cleanup + +.align 16 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lsincos_cossin_piby4: + + movapd .Lsincosarray+0x30(%rip),%xmm4 # s4 + movapd .Lcossinarray+0x30(%rip),%xmm5 # s4 + movdqa .Lsincosarray+0x10(%rip),%xmm8 # s2 + movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + movapd %xmm3,%xmm7 # sincos term upper x2 for x3 + + mulpd %xmm2,%xmm4 # x2s4 + mulpd %xmm3,%xmm5 # x2s4 + mulpd %xmm2,%xmm8 # x2s2 + mulpd %xmm3,%xmm9 # x2s2 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # s3+x2s4 + addpd .Lcossinarray+0x20(%rip),%xmm5 # s3+x2s4 + addpd .Lsincosarray(%rip),%xmm8 # s1+x2s2 + addpd .Lcossinarray(%rip),%xmm9 # s1+x2s2 + + mulpd %xmm0,%xmm4 # x4(s3+x2s4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + movhlps %xmm0,%xmm0 # move high x4 for cos term + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin) + mulpd %xmm1,%xmm7 + + mulsd %xmm10,%xmm6 # get low x3 for sin term (cossin) + movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos) + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + + movhlps %xmm2,%xmm12 # move high r for cos (cossin) + + + movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin) + movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos) + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm11,%xmm5 # cos *x4 + mulsd %xmm0,%xmm8 # cos *x4 + mulsd %xmm7,%xmm9 # sin *x3 + + subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0 + + movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos) + + addsd %xmm10,%xmm4 # sin + x + + addsd %xmm11,%xmm9 # sin + x + + + subsd %xmm12,%xmm8 # cos+t + subsd %xmm3,%xmm5 # cos+t + + movlhps %xmm8,%xmm4 # cossin + movlhps %xmm9,%xmm5 # sincos + + jmp .L__vrsa_cosf_cleanup + +.align 16 +.Lsincos_sincos_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd .Lcossinarray+0x30(%rip),%xmm4 # s4 + movapd .Lcossinarray+0x30(%rip),%xmm5 # s4 + movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2 + movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + movapd %xmm2,%xmm6 # move x2 for x4 + movapd %xmm3,%xmm7 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # s4+x2s3 + addpd .Lcossinarray+0x20(%rip),%xmm5 # s4+x2s3 + addpd .Lcossinarray(%rip),%xmm8 # s2+x2s1 + addpd .Lcossinarray(%rip),%xmm9 # s2+x2s1 + + mulpd %xmm0,%xmm4 # x4(s4+x2s3) + mulpd %xmm11,%xmm5 # x4(s4+x2s3) + + mulpd %xmm10,%xmm6 # get low x3 for sin term + mulpd %xmm1,%xmm7 # get low x3 for sin term + movhlps %xmm6,%xmm6 # move low x2 for x3 for sin term + movhlps %xmm7,%xmm7 # move low x2 for x3 for sin term + + mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos terms + mulsd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm4,%xmm12 # xmm8 = sin , xmm4 = cos + movhlps %xmm5,%xmm13 # xmm9 = sin , xmm5 = cos + + mulsd %xmm6,%xmm12 # sin *x3 + mulsd %xmm7,%xmm13 # sin *x3 + mulsd %xmm0,%xmm4 # cos *x4 + mulsd %xmm11,%xmm5 # cos *x4 + + subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 #-t=r-1.0 + + movhlps %xmm10,%xmm0 # move high x for x for sin term + movhlps %xmm1,%xmm11 # move high x for x for sin term + # Reverse 10 and 0 + + addsd %xmm0,%xmm12 # sin + x + addsd %xmm11,%xmm13 # sin + x + + subsd %xmm2,%xmm4 # cos+t + subsd %xmm3,%xmm5 # cos+t + + movlhps %xmm12,%xmm4 + movlhps %xmm13,%xmm5 + jmp .L__vrsa_cosf_cleanup + +.align 16 +.Lcossin_sincos_piby4: + + movapd .Lcossinarray+0x30(%rip),%xmm4 # s4 + movapd .Lsincosarray+0x30(%rip),%xmm5 # s4 + movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2 + movdqa .Lsincosarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + movapd %xmm2,%xmm7 # upper x2 for x3 for sin term (sincos) + + mulpd %xmm2,%xmm4 # x2s4 + mulpd %xmm3,%xmm5 # x2s4 + mulpd %xmm2,%xmm8 # x2s2 + mulpd %xmm3,%xmm9 # x2s2 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # s3+x2s4 + addpd .Lsincosarray+0x20(%rip),%xmm5 # s3+x2s4 + addpd .Lcossinarray(%rip),%xmm8 # s1+x2s2 + addpd .Lsincosarray(%rip),%xmm9 # s1+x2s2 + + mulpd %xmm0,%xmm4 # x4(s3+x2s4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + movhlps %xmm11,%xmm11 # move high x4 for cos term + + movsd %xmm3,%xmm6 # move low x2 for x3 for sin term (cossin) + mulpd %xmm10,%xmm7 + + mulsd %xmm1,%xmm6 # get low x3 for sin term (cossin) + movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos) + + mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm3,%xmm12 # move high r for cos (cossin) + + movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos (sincos) + movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin (cossin) + + mulsd %xmm0,%xmm4 # cos *x4 + mulsd %xmm6,%xmm5 # sin *x3 + mulsd %xmm7,%xmm8 # sin *x3 + mulsd %xmm11,%xmm9 # cos *x4 + + subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0 + + movhlps %xmm10,%xmm11 # move high x for x for sin term (sincos) + + subsd %xmm2,%xmm4 # cos-(-t) + subsd %xmm12,%xmm9 # cos-(-t) + + addsd %xmm11,%xmm8 # sin + x + addsd %xmm1,%xmm5 # sin + x + + movlhps %xmm8,%xmm4 # cossin + movlhps %xmm9,%xmm5 # sincos + + jmp .L__vrsa_cosf_cleanup +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_sinsin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: SIN +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: COS +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm0 # x2 ; SIN + movapd %xmm3,%xmm11 # x2 ; COS + movapd %xmm3,%xmm1 # copy of x2 for x4 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm0 # x4 + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm3,%xmm1 # x4 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcosarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm10,%xmm2 # x3 + + mulpd %xmm0,%xmm4 # x4(c3+x2c4) + mulpd %xmm1,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm1,%xmm5 # x4 * zc + + addpd %xmm10,%xmm4 # +x + subpd %xmm11,%xmm5 # +t + + jmp .L__vrsa_cosf_cleanup + +.align 16 +.Lsinsin_coscos_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: COS +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: SIN +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm0 # x2 ; COS + movapd %xmm3,%xmm11 # x2 ; SIN + movapd %xmm2,%xmm10 # copy of x2 for x4 + + movdqa .Lcosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # s4 + movapd .Lcosarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # s2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # s4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # s2*x2 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # s3+x2c4 + addpd .Lcosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # s1+x2c2 + + mulpd %xmm1,%xmm3 # x3 + mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm10,%xmm4 # x4(c3+x2c4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zs + + mulpd %xmm10,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x3 * zc + + subpd %xmm0,%xmm4 # +t + addpd %xmm1,%xmm5 # +x + + jmp .L__vrsa_cosf_cleanup +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_cossin_piby4: #Derive from cossin_coscos + movhlps %xmm2,%xmm0 # x2 for 0.5x2 for upper cos + movsd %xmm2,%xmm6 # lower x2 for x3 for lower sin + movapd %xmm3,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + movapd .Lsincosarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lsincosarray(%rip),%xmm8 # c2+x2c1 + addpd .Lcosarray(%rip),%xmm9 # c2+x2c1 + + movapd %xmm12,%xmm2 # upper=x4 + movsd %xmm6,%xmm2 # lower=x2 + mulsd %xmm10,%xmm2 # lower=x2*x + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # upper= x4 * zc + # lower=x3 * zs + mulpd %xmm13,%xmm5 # x4 * zc + + movlhps %xmm7,%xmm10 # + addpd %xmm10,%xmm4 # +x for lower sin, +t for upper cos + subpd %xmm11,%xmm5 # -(-t) + + jmp .L__vrsa_cosf_cleanup + +.align 16 +.Lcoscos_sincos_piby4: #Derive from cossin_coscos + movsd %xmm2,%xmm0 # x2 for 0.5x2 for lower cos + movapd %xmm3,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lcossinarray+0x30(%rip),%xmm4 # cs4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + movapd .Lcossinarray+0x10(%rip),%xmm8 # cs2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lcossinarray(%rip),%xmm8 # c2+x2c1 + addpd .Lcosarray(%rip),%xmm9 # c2+x2c1 + + mulpd %xmm10,%xmm2 # upper=x3 for sin + mulsd %xmm10,%xmm2 # lower=x4 for cos + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # lower= x4 * zc + # upper= x3 * zs + mulpd %xmm13,%xmm5 # x4 * zc + + movsd %xmm7,%xmm10 + addpd %xmm10,%xmm4 # +x for upper sin, +t for lower cos + subpd %xmm11,%xmm5 # -(-t) + + jmp .L__vrsa_cosf_cleanup + +.align 16 +.Lcossin_coscos_piby4: + movhlps %xmm3,%xmm0 # x2 for 0.5x2 for upper cos + movapd %xmm2,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + movsd %xmm3,%xmm6 # lower x2 for x3 for sin + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4 + movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4 + movapd .Lcosarray+0x10(%rip),%xmm8 # cs2 + movapd .Lsincosarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lsincosarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lcosarray(%rip),%xmm8 # c2+x2c1 + addpd .Lsincosarray(%rip),%xmm9 # c2+x2c1 + + movapd %xmm13,%xmm3 # upper=x4 + movsd %xmm6,%xmm3 # lower x2 + mulsd %xmm1,%xmm3 # lower x2*x + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm12,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # upper= x4 * zc + # lower=x3 * zs + + movlhps %xmm7,%xmm1 + addpd %xmm1,%xmm5 # +x for lower sin, +t for upper cos + subpd %xmm11,%xmm4 # -(-t) + + jmp .L__vrsa_cosf_cleanup + +.align 16 +.Lcossin_sinsin_piby4: # Derived from sincos_coscos + + movhlps %xmm3,%xmm0 # x2 + movapd %xmm3,%xmm7 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4 + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lsincosarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsincosarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsincosarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm10,%xmm2 # x3 + movapd %xmm13,%xmm3 # upper x4 for cos + movsd %xmm7,%xmm3 # lower x2 for sin + mulsd %xmm1,%xmm3 # lower x3=x2*x for sin + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm11,%xmm1 # t for upper cos and x for lower sin + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zs + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # upper=x4 * zc + # lower=x3 * zs + + addpd %xmm10,%xmm4 # +x + addpd %xmm1,%xmm5 # +t upper, +x lower + + jmp .L__vrsa_cosf_cleanup + +.align 16 +.Lsincos_coscos_piby4: + movsd %xmm3,%xmm0 # x2 for 0.5x2 for lower cos + movapd %xmm2,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4 + movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4 + movapd .Lcosarray+0x10(%rip),%xmm8 # cs2 + movapd .Lcossinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lcossinarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lcosarray(%rip),%xmm8 # c2+x2c1 + addpd .Lcossinarray(%rip),%xmm9 # c2+x2c1 + + mulpd %xmm1,%xmm3 # upper=x3 for sin + mulsd %xmm1,%xmm3 # lower=x4 for cos + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm12,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # lower= x4 * zc + # upper= x3 * zs + + movsd %xmm7,%xmm1 + subpd %xmm11,%xmm4 # -(-t) + addpd %xmm1,%xmm5 # +x for upper sin, +t for lower cos + + + jmp .L__vrsa_cosf_cleanup + +.align 16 +.Lsincos_sinsin_piby4: # Derived from sincos_coscos + + movsd %xmm3,%xmm0 # x2 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4 + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lcossinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcossinarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcossinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm10,%xmm2 # x3 + mulpd %xmm1,%xmm3 # upper x3 for sin + mulsd %xmm1,%xmm3 # lower x4 for cos + + movhlps %xmm1,%xmm6 + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm6,%xmm11 # upper =t ; lower =x + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + addpd %xmm10,%xmm4 # +x + addpd %xmm11,%xmm5 # +t lower, +x upper + + jmp .L__vrsa_cosf_cleanup + +.align 16 +.Lsinsin_cossin_piby4: # Derived from sincos_coscos + + movhlps %xmm2,%xmm0 # x2 + movapd %xmm2,%xmm7 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # c4 + movapd .Lsincosarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsincosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm1,%xmm3 # x3 + movapd %xmm12,%xmm2 # upper x4 for cos + movsd %xmm7,%xmm2 # lower x2 for sin + mulsd %xmm10,%xmm2 # lower x3=x2*x for sin + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm11,%xmm10 # t for upper cos and x for lower sin + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # upper=x4 * zc + # lower=x3 * zs + + addpd %xmm1,%xmm5 # +x + addpd %xmm10,%xmm4 # +t upper, +x lower + + jmp .L__vrsa_cosf_cleanup + +.align 16 +.Lsinsin_sincos_piby4: # Derived from sincos_coscos + + movsd %xmm2,%xmm0 # x2 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lcossinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # c4 + movapd .Lcossinarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lcossinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm1,%xmm3 # x3 + mulpd %xmm10,%xmm2 # upper x3 for sin + mulsd %xmm10,%xmm2 # lower x4 for cos + + movhlps %xmm10,%xmm6 + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm6,%xmm11 + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + + addpd %xmm1,%xmm5 # +x + addpd %xmm11,%xmm4 # +t lower, +x upper + + jmp .L__vrsa_cosf_cleanup + +.align 16 +.Lsinsin_sinsin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + #x2 = x * x; + #(x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4)))); + + #x + x3 * ((c1 + x2 *c2) + x4 * (c3 + x2 * c4)); + + + movapd %xmm2,%xmm0 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # c4 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # c2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4 + + mulpd %xmm10,%xmm2 # x3 + mulpd %xmm1,%xmm3 # x3 + + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm0,%xmm4 # x4(c3+x2c4) + mulpd %xmm11,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zs + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # x3 * zs + + addpd %xmm10,%xmm4 # +x + addpd %xmm1,%xmm5 # +x + + jmp .L__vrsa_cosf_cleanup
diff --git a/src/gas/vrsaexpf.S b/src/gas/vrsaexpf.S new file mode 100644 index 0000000..399943e --- /dev/null +++ b/src/gas/vrsaexpf.S
@@ -0,0 +1,766 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrsaexpf.s +# +# An array implementation of the expf libm function. +# +# Prototype: +# +# void vrsa_expf(int n, float *x, float *y); +# +# Computes e raised to the x power for an array of input values. +# Places the results into the supplied y array. +# This routine implemented in single precision. It is slightly +# less accurate than the double precision version, but it will +# be better for vectorizing. +# Does not perform error handling, but does return C99 values for error +# inputs. Denormal results are truncated to 0. + +# This array version is basically a unrolling of the by4 scalar single +# routine. The second set of operations is performed by the indented +# instructions interleaved into the first set. +# The scheduling is done by trial and error. The resulting code represents +# the best time of many variations. It would seem more interleaving could +# be done, as there is a long stretch of the second computation that is not +# interleaved. But moving any of this code forward makes the routine +# slower. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + +# define local variable storage offsets +.equ p_ux,0x00 #qword +.equ p_ux2,0x010 #qword + +.equ save_xa,0x020 #qword +.equ save_ya,0x028 #qword +.equ save_nv,0x030 #qword + + +.equ p_iter,0x038 # qword storage for number of loop iterations + +.equ p_j,0x040 # second temporary for get/put bits operation +.equ p_m,0x050 #qword +.equ p_j2,0x060 # second temporary for exponent multiply +.equ p_m2,0x070 #qword +.equ save_rbx,0x080 #qword + + +.equ stack_size,0x098 + + .weak vrsa_expf_ + .set vrsa_expf_,__vrsa_expf__ + .weak vrsa_expf__ + .set vrsa_expf__,__vrsa_expf__ + +# parameters are passed in by gcc as: +# rdi - int n +# rsi - double *x +# rdx - double *y + + .text + .align 16 + .p2align 4,,15 + +#/* a FORTRAN subroutine implementation of array expf +#** VRSA_EXPF(N,X,Y) +# C equivalent*/ +#void vrsa_expf__(int * n, float *x, float *y) +#{ +# vrsa_expf(*n,x,y); +#} +.globl __vrsa_expf__ + .type __vrsa_expf__,@function +__vrsa_expf__: + mov (%rdi),%edi + + .align 16 + .p2align 4,,15 +.globl vrsa_expf + .type vrsa_expf,@function +vrsa_expf: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) + +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + mov %rdi,save_nv(%rsp) # save number of values + +# see if too few values to call the main loop + shr $3,%rax # get number of iterations + jz .L__vsa_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $3,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +# In this second version, process the array 8 values at a time. + +.L__vsa_top: +# build the input _m128 + movaps .L__real_thirtytwo_by_log2(%rip),%xmm3 # + mov save_xa(%rsp),%rsi # get x_array pointer + movups (%rsi),%xmm0 + movups 16(%rsi),%xmm6 + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + movaps %xmm0,p_ux(%rsp) + maxps .L__real_m8192(%rip),%xmm0 + movaps %xmm6,p_ux2(%rsp) + maxps .L__real_m8192(%rip),%xmm6 + + +# /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */ +# Step 1. Reduce the argument. + # r = x * thirtytwo_by_logbaseof2; + movaps .L__real_thirtytwo_by_log2(%rip),%xmm2 # + + mulps %xmm0,%xmm2 + xor %rax,%rax + minps .L__real_8192(%rip),%xmm2 + movaps .L__real_thirtytwo_by_log2(%rip),%xmm5 # + + mulps %xmm6,%xmm5 + minps .L__real_8192(%rip),%xmm5 # protect against large input values + + +# /* Set n = nearest integer to r */ + cvtps2dq %xmm2,%xmm3 + lea .L__two_to_jby32_table(%rip),%rdi + cvtdq2ps %xmm3,%xmm1 + + cvtps2dq %xmm5,%xmm8 + cvtdq2ps %xmm8,%xmm7 +# r1 = x - n * logbaseof2_by_32_lead; + movaps .L__real_log2_by_32_head(%rip),%xmm2 + mulps %xmm1,%xmm2 + subps %xmm2,%xmm0 # r1 in xmm0, + + movaps .L__real_log2_by_32_head(%rip),%xmm5 + mulps %xmm7,%xmm5 + subps %xmm5,%xmm6 # r1 in xmm6, + + +# r2 = - n * logbaseof2_by_32_lead; + mulps .L__real_log2_by_32_tail(%rip),%xmm1 + mulps .L__real_log2_by_32_tail(%rip),%xmm7 + +# j = n & 0x0000001f; + movdqa %xmm3,%xmm4 + movdqa .L__int_mask_1f(%rip),%xmm2 + movdqa %xmm8,%xmm9 + movdqa .L__int_mask_1f(%rip),%xmm5 + pand %xmm4,%xmm2 + movdqa %xmm2,p_j(%rsp) +# f1 = two_to_jby32_lead_table[j); + + pand %xmm9,%xmm5 + movdqa %xmm5,p_j2(%rsp) + +# *m = (n - j) / 32; + psubd %xmm2,%xmm4 + psrad $5,%xmm4 + movdqa %xmm4,p_m(%rsp) + psubd %xmm5,%xmm9 + psrad $5,%xmm9 + movdqa %xmm9,p_m2(%rsp) + + movaps %xmm0,%xmm3 + addps %xmm1,%xmm3 # r = r1+ r2 + + mov p_j(%rsp),%eax # get an individual index + movaps %xmm6,%xmm8 + mov (%rdi,%rax,4),%edx # get the f1 value + addps %xmm7,%xmm8 # r = r1+ r2 + mov %edx,p_j(%rsp) # save the f1 value + +# Step 2. Compute the polynomial. +# q = r1 + +# r*r*( 5.00000000000000008883e-01 + +# r*( 1.66666666665260878863e-01 + +# r*( 4.16666666662260795726e-02 + +# r*( 8.33336798434219616221e-03 + +# r*( 1.38889490863777199667e-03 ))))); +# q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720 +# q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision + movaps %xmm3,%xmm4 + movaps %xmm3,%xmm2 + mulps %xmm2,%xmm2 # x*x + mulps .L__real_1_24(%rip),%xmm4 # /24 + + mov p_j+4(%rsp),%eax # get an individual index + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j+4(%rsp) # save the f1 value + + addps .L__real_1_6(%rip),%xmm4 # +1/6 + + mulps %xmm2,%xmm3 # x^3 + mov p_j+8(%rsp),%eax # get an individual index + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j+8(%rsp) # save the f1 value + mulps .L__real_half(%rip),%xmm2 # x^2/2 + mov p_j+12(%rsp),%eax # get an individual index + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j+12(%rsp) # save the f1 value + mulps %xmm3,%xmm4 # *x^3 + mov p_j2(%rsp),%eax # get an individual index + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j2(%rsp) # save the f1 value + + addps %xmm4,%xmm1 # +r2 + + addps %xmm2,%xmm1 # + x^2/2 + addps %xmm1,%xmm0 # +r1 + + movaps %xmm8,%xmm9 + mov p_j2+4(%rsp),%eax # get an individual index + movaps %xmm8,%xmm5 + mulps %xmm5,%xmm5 # x*x + mulps .L__real_1_24(%rip),%xmm9 # /24 + + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j2+4(%rsp) # save the f1 value + +# deal with infinite or denormal results + movdqa p_m(%rsp),%xmm1 + movdqa p_m(%rsp),%xmm2 + pcmpgtd .L__int_127(%rip),%xmm2 + pminsw .L__int_128(%rip),%xmm1 # ceil at 128 + movmskps %xmm2,%eax + test $0x0f,%eax + + paddd .L__int_127(%rip),%xmm1 # add bias + +# *z2 = f2 + ((f1 + f2) * q); + mulps p_j(%rsp),%xmm0 # * f1 + addps p_j(%rsp),%xmm0 # + f1 + jnz .L__exp_largef +.L__check1: + + + pxor %xmm2,%xmm2 # floor at 0 + pmaxsw %xmm2,%xmm1 + + pslld $23,%xmm1 # build 2^n + + movaps %xmm1,%xmm2 + + + +# check for infinity or nan + movaps p_ux(%rsp),%xmm1 + andps .L__real_infinity(%rip),%xmm1 + cmpps $0,.L__real_infinity(%rip),%xmm1 + movmskps %xmm1,%ebx + test $0x0f,%ebx + + +# end of splitexp +# /* Scale (z1 + z2) by 2.0**m */ +# Step 3. Reconstitute. + + mulps %xmm2,%xmm0 # result *= 2^n + +# we'd like to avoid a branch, and can use cmp's and and's to +# eliminate them. But it adds cycles for normal cases +# to handle events that are supposed to be exceptions. +# Using this branch with the +# check above results in faster code for the normal cases. +# And branch mispredict penalties should only come into +# play for nans and infinities. + jnz .L__exp_naninf +.L__vsa_bottom1: + + # q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision + addps .L__real_1_6(%rip),%xmm9 # +1/6 + + mulps %xmm5,%xmm8 # x^3 + mov p_j2+8(%rsp),%eax # get an individual index + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j2+8(%rsp) # save the f1 value + mulps .L__real_half(%rip),%xmm5 # x^2/2 + mulps %xmm8,%xmm9 # *x^3 + + mov p_j2+12(%rsp),%eax # get an individual index + mov (%rdi,%rax,4),%edx # get the f1 value + mov %edx,p_j2+12(%rsp) # save the f1 value + addps %xmm9,%xmm7 # +r2 + + addps %xmm5,%xmm7 # + x^2/2 + addps %xmm7,%xmm6 # +r1 + + + # deal with infinite or denormal results + movdqa p_m2(%rsp),%xmm7 + movdqa p_m2(%rsp),%xmm5 + pcmpgtd .L__int_127(%rip),%xmm5 + pminsw .L__int_128(%rip),%xmm7 # ceil at 128 + movmskps %xmm5,%eax + test $0x0f,%eax + + paddd .L__int_127(%rip),%xmm7 # add bias + + # *z2 = f2 + ((f1 + f2) * q); + mulps p_j2(%rsp),%xmm6 # * f1 + addps p_j2(%rsp),%xmm6 # + f1 + jnz .L__exp_largef2 +.L__check2: + + pxor %xmm5,%xmm5 # floor at 0 + pmaxsw %xmm5,%xmm7 + + pslld $23,%xmm7 # build 2^n + + movaps %xmm7,%xmm5 + + + # check for infinity or nan + movaps p_ux2(%rsp),%xmm7 + andps .L__real_infinity(%rip),%xmm7 + cmpps $0,.L__real_infinity(%rip),%xmm7 + movmskps %xmm7,%ebx + test $0x0f,%ebx + + + # end of splitexp + # /* Scale (z1 + z2) by 2.0**m */ + # Step 3. Reconstitute. + + mulps %xmm5,%xmm6 # result *= 2^n +#__vsa_bottom1: +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + movups %xmm0,(%rdi) + + jnz .L__exp_naninf2 + +.L__vsa_bottom2: + + prefetch 64(%rdi) + add $32,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + +# store the result _m128d + movups %xmm6,-16(%rdi) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vsa_top + + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vsa_cleanup + + +# +.L__final_check: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + +# at least one of the numbers needs special treatment +.L__exp_naninf: + lea p_ux(%rsp),%rcx + call .L__fexp_naninf + jmp .L__vsa_bottom1 +.L__exp_naninf2: + lea p_ux2(%rsp),%rcx + movaps %xmm6,%xmm0 + call .L__fexp_naninf + movaps %xmm0,%xmm6 + jmp .L__vsa_bottom2 + +# deal with nans and infinities +# This subroutine checks a packed single for nans and infinities and +# produces the proper result from the exceptional inputs +# Register assumptions: +# Inputs: +# rbx - mask of errors +# xmm0 - computed result vector +# Outputs: +# xmm0 - new result vector +# %rax,rdx,rbx,%xmm2 all modified. + +.L__fexp_naninf: + movaps %xmm0,p_j+8(%rsp) # save the computed values + test $1,%ebx # first value? + jz .L__Lni2 + mov 0(%rcx),%edx # get the input + call .L__naninf + mov %edx,p_j+8(%rsp) # copy the result +.L__Lni2: + test $2,%ebx # second value? + jz .L__Lni3 + mov 4(%rcx),%edx # get the input + call .L__naninf + mov %edx,p_j+12(%rsp) # copy the result +.L__Lni3: + test $4,%ebx # third value? + jz .L__Lni4 + mov 8(%rcx),%edx # get the input + call .L__naninf + mov %edx,p_j+16(%rsp) # copy the result +.L__Lni4: + test $8,%ebx # fourth value? + jz .L__Lnie + mov 12(%rcx),%edx # get the input + call .L__naninf + mov %edx,p_j+20(%rsp) # copy the result +.L__Lnie: + movaps p_j+8(%rsp),%xmm0 # get the answers + ret + +# +# a simple subroutine to check a scalar input value for infinity +# or NaN and return the correct result +# expects input in .Land,%edx returns value in edx. Destroys eax. +.L__naninf: + mov $0x0007FFFFF,%eax + test %eax,%edx + jnz .L__enan # jump if mantissa not zero, so it's a NaN +# inf + mov %edx,%eax + rcl $1,%eax + jnc .L__r # exp(+inf) = inf + xor %edx,%edx # exp(-inf) = 0 + jmp .L__r + +#NaN +.L__enan: + mov $0x000400000,%eax # convert to quiet + or %eax,%edx +.L__r: + ret + + + .align 16 +# we jump here when we have an odd number of exp calls to make at the +# end +.L__vsa_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__final_check # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + +# fill in a m128 with zeroes and the extra values and then make a recursive call. + xorps %xmm0,%xmm0 + movaps %xmm0,p_j(%rsp) + movaps %xmm0,p_j+16(%rsp) + + mov (%rsi),%ecx # we know there's at least one + mov %ecx,p_j(%rsp) + cmp $2,%rax + jl .L__vsacg + + mov 4(%rsi),%ecx # do the second value + mov %ecx,p_j+4(%rsp) + cmp $3,%rax + jl .L__vsacg + + mov 8(%rsi),%ecx # do the third value + mov %ecx,p_j+8(%rsp) + cmp $4,%rax + jl .L__vsacg + + mov 12(%rsi),%ecx # do the fourth value + mov %ecx,p_j+12(%rsp) + cmp $5,%rax + jl .L__vsacg + + mov 16(%rsi),%ecx # do the fifth value + mov %ecx,p_j+16(%rsp) + cmp $6,%rax + jl .L__vsacg + + mov 20(%rsi),%ecx # do the sixth value + mov %ecx,p_j+20(%rsp) + cmp $7,%rax + jl .L__vsacg + + mov 24(%rsi),%ecx # do the last value + mov %ecx,p_j+24(%rsp) + +.L__vsacg: + mov $8,%rdi # parameter for N + lea p_j(%rsp),%rsi # &x parameter + lea p_j2(%rsp),%rdx # &y parameter + call vrsa_expf@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + mov p_j2(%rsp),%ecx + mov %ecx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vsacgf + + mov p_j2+4(%rsp),%ecx + mov %ecx,4(%rdi) # do the second value + cmp $3,%rax + jl .L__vsacgf + + mov p_j2+8(%rsp),%ecx + mov %ecx,8(%rdi) # do the second value + cmp $4,%rax + jl .L__vsacgf + + mov p_j2+12(%rsp),%ecx + mov %ecx,12(%rdi) # do the second value + cmp $5,%rax + jl .L__vsacgf + + mov p_j2+16(%rsp),%ecx + mov %ecx,16(%rdi) # do the second value + cmp $6,%rax + jl .L__vsacgf + + mov p_j2+20(%rsp),%ecx + mov %ecx,20(%rdi) # do the second value + cmp $7,%rax + jl .L__vsacgf + + mov p_j2+24(%rsp),%ecx + mov %ecx,24(%rdi) # do the last value + +.L__vsacgf: + jmp .L__final_check + + .align 16 +# deal with m > 127. In some instances, rounding during calculations +# can result in infinity when it shouldn't. For these cases, we scale +# m down, and scale the mantissa up. + +.L__exp_largef: + movdqa %xmm0,p_j(%rsp) # save the mantissa portion + movdqa %xmm1,p_m(%rsp) # save the exponent portion + mov %eax,%ecx # save the error mask + test $1,%ecx # first value? + jz .L__Lf2 + mov p_m(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m(%rsp) # save the exponent + movss p_j(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j(%rsp) # save the mantissa +.L__Lf2: + test $2,%ecx # second value? + jz .L__Lf3 + mov p_m+4(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m+4(%rsp) # save the exponent + movss p_j+4(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+4(%rsp) # save the mantissa +.L__Lf3: + test $4,%ecx # third value? + jz .L__Lf4 + mov p_m+8(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m+8(%rsp) # save the exponent + movss p_j+8(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+8(%rsp) # save the mantissa +.L__Lf4: + test $8,%ecx # fourth value? + jz .L__Lfe + mov p_m+12(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m+12(%rsp) # save the exponent + movss p_j+12(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+12(%rsp) # save the mantissa +.L__Lfe: + movaps p_j(%rsp),%xmm0 # restore the mantissa portion back + movdqa p_m(%rsp),%xmm1 # restore the exponent portion + jmp .L__check1 + .align 16 + +.L__exp_largef2: + movdqa %xmm6,p_j(%rsp) # save the mantissa portion + movdqa %xmm7,p_m2(%rsp) # save the exponent portion + mov %eax,%ecx # save the error mask + test $1,%ecx # first value? + jz .L__Lf22 + mov p_m2+0(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m2+0(%rsp) # save the exponent + movss p_j+0(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+0(%rsp) # save the mantissa +.L__Lf22: + test $2,%ecx # second value? + jz .L__Lf32 + mov p_m2+4(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m2+4(%rsp) # save the exponent + movss p_j+4(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+4(%rsp) # save the mantissa +.L__Lf32: + test $4,%ecx # third value? + jz .L__Lf42 + mov p_m2+8(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m2+8(%rsp) # save the exponent + movss p_j+8(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+8(%rsp) # save the mantissa +.L__Lf42: + test $8,%ecx # fourth value? + jz .L__Lfe2 + mov p_m2+12(%rsp),%edx # get the exponent + sub $1,%edx # scale it down + mov %edx,p_m2+12(%rsp) # save the exponent + movss p_j+12(%rsp),%xmm3 # get the mantissa + mulss .L__real_two(%rip),%xmm3 # scale it up + movss %xmm3,p_j+12(%rsp) # save the mantissa +.L__Lfe2: + movaps p_j(%rsp),%xmm6 # restore the mantissa portion back + movdqa p_m2(%rsp),%xmm7 # restore the exponent portion + jmp .L__check2 + + + .data + .align 64 +.L__real_half: .long 0x03f000000 # 1/2 + .long 0x03f000000 + .long 0x03f000000 + .long 0x03f000000 +.L__real_two: .long 0x40000000 # 2 + .long 0x40000000 + .long 0x40000000 + .long 0x40000000 +.L__real_8192: .long 0x46000000 # 8192, to protect against really large numbers + .long 0x46000000 + .long 0x46000000 + .long 0x46000000 +.L__real_m8192: .long 0xC6000000 # -8192, to protect against really small numbers + .long 0xC6000000 + .long 0xC6000000 + .long 0xC6000000 +.L__real_thirtytwo_by_log2: .long 0x04238AA3B # thirtytwo_by_log2 + .long 0x04238AA3B + .long 0x04238AA3B + .long 0x04238AA3B +.L__real_log2_by_32: .long 0x03CB17218 # log2_by_32 + .long 0x03CB17218 + .long 0x03CB17218 + .long 0x03CB17218 +.L__real_log2_by_32_head: .long 0x03CB17000 # log2_by_32 + .long 0x03CB17000 + .long 0x03CB17000 + .long 0x03CB17000 +.L__real_log2_by_32_tail: .long 0x0B585FDF4 # log2_by_32 + .long 0x0B585FDF4 + .long 0x0B585FDF4 + .long 0x0B585FDF4 +.L__real_1_6: .long 0x03E2AAAAB # 0.16666666666 used in polynomial + .long 0x03E2AAAAB + .long 0x03E2AAAAB + .long 0x03E2AAAAB +.L__real_1_24: .long 0x03D2AAAAB # 0.041666668 used in polynomial + .long 0x03D2AAAAB + .long 0x03D2AAAAB + .long 0x03D2AAAAB +.L__real_1_120: .long 0x03C088889 # 0.0083333338 used in polynomial + .long 0x03C088889 + .long 0x03C088889 + .long 0x03C088889 +.L__real_infinity: .long 0x07f800000 # infinity + .long 0x07f800000 + .long 0x07f800000 + .long 0x07f800000 +.L__int_mask_1f: .long 0x00000001f + .long 0x00000001f + .long 0x00000001f + .long 0x00000001f +.L__int_128: .long 0x000000080 + .long 0x000000080 + .long 0x000000080 + .long 0x000000080 +.L__int_127: .long 0x00000007f + .long 0x00000007f + .long 0x00000007f + .long 0x00000007f + +.L__two_to_jby32_table: + .long 0x03F800000 # 1.0000000000000000 + .long 0x03F82CD87 # 1.0218971486541166 + .long 0x03F85AAC3 # 1.0442737824274138 + .long 0x03F88980F # 1.0671404006768237 + .long 0x03F8B95C2 # 1.0905077326652577 + .long 0x03F8EA43A # 1.1143867425958924 + .long 0x03F91C3D3 # 1.1387886347566916 + .long 0x03F94F4F0 # 1.1637248587775775 + .long 0x03F9837F0 # 1.1892071150027210 + .long 0x03F9B8D3A # 1.2152473599804690 + .long 0x03F9EF532 # 1.2418578120734840 + .long 0x03FA27043 # 1.2690509571917332 + .long 0x03FA5FED7 # 1.2968395546510096 + .long 0x03FA9A15B # 1.3252366431597413 + .long 0x03FAD583F # 1.3542555469368927 + .long 0x03FB123F6 # 1.3839098819638320 + .long 0x03FB504F3 # 1.4142135623730951 + .long 0x03FB8FBAF # 1.4451808069770467 + .long 0x03FBD08A4 # 1.4768261459394993 + .long 0x03FC12C4D # 1.5091644275934228 + .long 0x03FC5672A # 1.5422108254079407 + .long 0x03FC9B9BE # 1.5759808451078865 + .long 0x03FCE248C # 1.6104903319492543 + .long 0x03FD2A81E # 1.6457554781539649 + .long 0x03FD744FD # 1.6817928305074290 + .long 0x03FDBFBB8 # 1.7186192981224779 + .long 0x03FE0CCDF # 1.7562521603732995 + .long 0x03FE5B907 # 1.7947090750031072 + .long 0x03FEAC0C7 # 1.8340080864093424 + .long 0x03FEFE4BA # 1.8741676341103000 + .long 0x03FF5257D # 1.9152065613971474 + .long 0x03FFA83B3 # 1.9571441241754002 + .long 0 # for alignment + + +
diff --git a/src/gas/vrsalog10f.S b/src/gas/vrsalog10f.S new file mode 100644 index 0000000..003eaf1 --- /dev/null +++ b/src/gas/vrsalog10f.S
@@ -0,0 +1,1149 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrsalogf.s +# +# An array implementation of the logf libm function. +# +# Prototype: +# +# void vrsa_log10f(int n, float *x, float *y); +# +# Computes the natural log of x. +# Places the results into the supplied y array. +# Does not perform error handling, but does return C99 values for error +# inputs. Denormal results are truncated to 0. + +# This array version is basically a unrolling of the by4 scalar single +# routine. The second set of operations is performed by the indented +# instructions interleaved into the first set. +# +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + + .weak vrsa_log10f_ + .set vrsa_log10f_,__vrsa_log10f__ + .weak vrsa_log10f__ + .set vrsa_log10f__,__vrsa_log10f__ + + .text + .align 16 + .p2align 4,,15 + +#/* a FORTRAN subroutine implementation of array logf +#** VRSA_LOG10F(N,X,Y) +# C equivalent*/ +#void vrsa_log10f__(int * n, float *x, float *y) +#{ +# vrsa_log10f(*n,x,y); +#} +.globl __vrsa_log10f__ + .type __vrsa_log10f__,@function +__vrsa_log10f__: + mov (%rdi),%edi + + .align 16 + .p2align 4,,15 + + +# define local variable storage offsets +.equ p_x,0 # save x +.equ p_idx,0x010 # xmmword index +.equ p_z1,0x020 # xmmword index +.equ p_q,0x030 # xmmword index +.equ p_corr,0x040 # xmmword index +.equ p_omask,0x050 # xmmword index + + +.equ p_x2,0x0100 # save x +.equ p_idx2,0x0110 # xmmword index +.equ p_z12,0x0120 # xmmword index +.equ p_q2,0x0130 # xmmword index + +.equ save_xa,0x0140 #qword +.equ save_ya,0x0148 #qword +.equ save_nv,0x0150 #qword +.equ p_iter,0x0158 # qword storage for number of loop iterations + +.equ save_rbx,0x0160 # +.equ save_rdi,0x0168 #qword +.equ save_rsi,0x0170 #qword + +.equ p2_temp,0x0180 #qword +.equ p2_temp1,0x01a0 #qword + +.equ stack_size,0x01c8 + + + + +# parameters are passed in by gcc as: +# rdi - int n +# rsi - double *x +# rdx - double *y + +.globl vrsa_log10f + .type vrsa_log10f,@function +vrsa_log10f: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rdi + +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + mov %rdi,save_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $3,%rax # get number of iterations + jz .L__vsa_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $3,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +# In this second version, process the array 8 values at a time. + +.L__vsa_top: +# build the input _m128 + mov save_xa(%rsp),%rsi # get x_array pointer + movups (%rsi),%xmm0 + movups 16(%rsi),%xmm12 +# movhps .LQWORD,%xmm0 PTR [rsi+8] + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + +# check e as a special case + movdqa %xmm0,p_x(%rsp) # save x + movdqa %xmm12,p_x2(%rsp) # save x + movdqa %xmm0,%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movmskps %xmm2,%r9d + + movdqa %xmm12,%xmm9 + movaps %xmm12,%xmm7 + +# +# compute the index into the log tables +# + movdqa %xmm0,%xmm3 + movaps %xmm0,%xmm1 + psrld $23,%xmm3 + + # + # compute the index into the log tables + # + psrld $23,%xmm9 + subps .L__real_one(%rip),%xmm7 + psubd .L__mask_127(%rip),%xmm9 + subps .L__real_one(%rip),%xmm1 + psubd .L__mask_127(%rip),%xmm3 + cvtdq2ps %xmm9,%xmm13 # xexp + + movdqa %xmm12,%xmm9 + pand .L__real_mant(%rip),%xmm9 + xor %r8,%r8 + movdqa %xmm9,%xmm8 + movaps .L__real_half(%rip),%xmm11 # .5 + cvtdq2ps %xmm3,%xmm6 # xexp + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + xor %r8,%r8 + movdqa %xmm3,%xmm2 + movaps .L__real_half(%rip),%xmm5 # .5 + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrld $16,%xmm3 + lea .L__np_ln_lead_table(%rip),%rdx + movdqa %xmm3,%xmm4 + psrld $16,%xmm9 + movdqa %xmm9,%xmm10 + psrld $1,%xmm9 + psrld $1,%xmm3 + paddd .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm4 + paddd %xmm4,%xmm3 + cvtdq2ps %xmm3,%xmm1 + #/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + paddd .L__mask_040(%rip),%xmm9 + pand .L__mask_001(%rip),%xmm10 + paddd %xmm10,%xmm9 + cvtdq2ps %xmm9,%xmm7 + packssdw %xmm3,%xmm3 + movq %xmm3,p_idx(%rsp) + packssdw %xmm9,%xmm9 + movq %xmm9,p_idx2(%rsp) + + +# reduce and get u + movdqa %xmm0,%xmm3 + orps .L__real_half(%rip),%xmm2 + + + mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128 + # reduce and get u + + + subps %xmm1,%xmm2 # f2 = f - f1 + mulps %xmm2,%xmm5 + addps %xmm5,%xmm1 + + divps %xmm1,%xmm2 # u + + movdqa %xmm12,%xmm9 + orps .L__real_half(%rip),%xmm8 + + + mulps .L__real_3c000000(%rip),%xmm7 # f1 = index/128 + subps %xmm7,%xmm8 # f2 = f - f1 + mulps %xmm8,%xmm11 + addps %xmm11,%xmm7 + + + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1(%rsp) # save the f1 values + + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1+8(%rsp) # save the f1 value + + divps %xmm7,%xmm8 # u + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx2(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z12(%rsp) # save the f1 values + + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z12+8(%rsp) # save the f1 value + +# solve for ln(1+u) + movaps %xmm2,%xmm1 # u + mulps %xmm2,%xmm2 # u^2 + movaps %xmm2,%xmm5 + movaps .L__real_cb3(%rip),%xmm3 + mulps %xmm2,%xmm3 #Cu2 + mulps %xmm1,%xmm5 # u^3 + addps .L__real_cb2(%rip),%xmm3 #B+Cu2 + movaps %xmm2,%xmm4 + mulps %xmm5,%xmm4 # u^5 + movaps .L__real_log2_lead(%rip),%xmm2 + + mulps .L__real_cb1(%rip),%xmm5 #Au3 + addps %xmm5,%xmm1 # u+Au3 + mulps %xmm3,%xmm4 # u5(B+Cu2) + + lea .L__np_ln_tail_table(%rip),%rdx + addps %xmm4,%xmm1 # poly + +# recombine + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q(%rsp) # save the f2 value + + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q+8(%rsp) # save the f2 value + + addps p_q(%rsp),%xmm1 #z2 +=q + + movaps p_z1(%rsp),%xmm0 # z1 values + + mulps %xmm6,%xmm2 + addps %xmm2,%xmm0 #r1 + movaps %xmm0,%xmm2 + mulps .L__real_log2_tail(%rip),%xmm6 + addps %xmm6,%xmm1 #r2 + movaps %xmm1,%xmm3 + +# logef to log10f + mulps .L__real_log10e_tail(%rip),%xmm1 + mulps .L__real_log10e_tail(%rip),%xmm0 + mulps .L__real_log10e_lead(%rip),%xmm3 + mulps .L__real_log10e_lead(%rip),%xmm2 + addps %xmm1,%xmm0 + addps %xmm3,%xmm0 + addps %xmm2,%xmm0 +# addps %xmm1,%xmm0 + + + +# check for e +# test $0x0f,%r9d + # jnz .L__vlogf_e +.L__f1: + +# check for negative numbers or zero + xorps %xmm1,%xmm1 + cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also. + movmskps %xmm1,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg + +.L__f2: +## if +inf + movaps p_x(%rsp),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__log_inf +.L__f3: + + movaps p_x(%rsp),%xmm3 + subps .L__real_one(%rip),%xmm3 + andps .L__real_notsign(%rip),%xmm3 + cmpps $2,.L__real_threshold(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__near_one +.L__f4: + +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + movups %xmm0,(%rdi) + +# finish the second set of calculations + + # solve for ln(1+u) + movaps %xmm8,%xmm7 # u + mulps %xmm8,%xmm8 # u^2 + movaps %xmm8,%xmm11 + + movaps .L__real_cb3(%rip),%xmm9 + mulps %xmm8,%xmm9 #Cu2 + mulps %xmm7,%xmm11 # u^3 + addps .L__real_cb2(%rip),%xmm9 #B+Cu2 + movaps %xmm8,%xmm10 + mulps %xmm11,%xmm10 # u^5 + movaps .L__real_log2_lead(%rip),%xmm8 + + mulps .L__real_cb1(%rip),%xmm11 #Au3 + addps %xmm11,%xmm7 # u+Au3 + mulps %xmm9,%xmm10 # u5(B+Cu2) + addps %xmm10,%xmm7 # poly + + + # recombine + lea .L__np_ln_tail_table(%rip),%rdx + mov p_idx2(%rsp),%rcx # get the indexes + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q2(%rsp) # save the f2 value + + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q2+8(%rsp) # save the f2 value + + addps p_q2(%rsp),%xmm7 #z2 +=q + movaps p_z12(%rsp),%xmm1 # z1 values + + mulps %xmm13,%xmm8 + addps %xmm8,%xmm1 #r1 + movaps %xmm1,%xmm8 + mulps .L__real_log2_tail(%rip),%xmm13 + addps %xmm13,%xmm7 #r2 + movaps %xmm7,%xmm9 + # logef to log10f + mulps .L__real_log10e_tail(%rip),%xmm7 + mulps .L__real_log10e_tail(%rip),%xmm1 + mulps .L__real_log10e_lead(%rip),%xmm9 + mulps .L__real_log10e_lead(%rip),%xmm8 + addps %xmm7,%xmm1 + addps %xmm9,%xmm1 + addps %xmm8,%xmm1 +# addps %xmm7,%xmm1 + + # check e as a special case +# movaps p_x2(%rsp),%xmm10 +# cmpps $0,.L__real_ef(%rip),%xmm10 +# movmskps %xmm10,%r9d + # check for e +# test $0x0f,%r9d +# jnz .L__vlogf_e2 +.L__f12: + + # check for negative numbers or zero + xorps %xmm7,%xmm7 + cmpps $1,p_x2(%rsp),%xmm7 # 0 greater than =?. catches NaNs also. + movmskps %xmm7,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg2 + +.L__f22: + ## if +inf + movaps p_x2(%rsp),%xmm9 + cmpps $0,.L__real_inf(%rip),%xmm9 + movmskps %xmm9,%r9d + test $0x0f,%r9d + jnz .L__log_inf2 +.L__f32: + + movaps p_x2(%rsp),%xmm9 + subps .L__real_one(%rip),%xmm9 + andps .L__real_notsign(%rip),%xmm9 + cmpps $2,.L__real_threshold(%rip),%xmm9 + movmskps %xmm9,%r9d + test $0x0f,%r9d + jnz .L__near_one2 +.L__f42: + + + prefetch 64(%rsi) + add $32,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + +# store the result _m128d + movups %xmm1,-16(%rdi) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vsa_top + + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vsa_cleanup + + +# +.L__final_check: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + + + + .align 16 +# we jump here when we have an odd number of log calls to make at the +# end +.L__vsa_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__final_check # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + +# fill in a m128 with zeroes and the extra values and then make a recursive call. + xorps %xmm0,%xmm0 + movaps %xmm0,p2_temp(%rsp) + movaps %xmm0,p2_temp+16(%rsp) + + mov (%rsi),%ecx # we know there's at least one + mov %ecx,p2_temp(%rsp) + cmp $2,%rax + jl .L__vsacg + + mov 4(%rsi),%ecx # do the second value + mov %ecx,p2_temp+4(%rsp) + cmp $3,%rax + jl .L__vsacg + + mov 8(%rsi),%ecx # do the third value + mov %ecx,p2_temp+8(%rsp) + cmp $4,%rax + jl .L__vsacg + + mov 12(%rsi),%ecx # do the fourth value + mov %ecx,p2_temp+12(%rsp) + cmp $5,%rax + jl .L__vsacg + + mov 16(%rsi),%ecx # do the fifth value + mov %ecx,p2_temp+16(%rsp) + cmp $6,%rax + jl .L__vsacg + + mov 20(%rsi),%ecx # do the sixth value + mov %ecx,p2_temp+20(%rsp) + cmp $7,%rax + jl .L__vsacg + + mov 24(%rsi),%ecx # do the last value + mov %ecx,p2_temp+24(%rsp) + +.L__vsacg: + mov $8,%rdi # parameter for N + lea p2_temp(%rsp),%rsi # &x parameter + lea p2_temp1(%rsp),%rdx # &y parameter + call vrsa_log10f@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + mov p2_temp1(%rsp),%ecx + mov %ecx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vsacgf + + mov p2_temp1+4(%rsp),%ecx + mov %ecx,4(%rdi) # do the second value + cmp $3,%rax + jl .L__vsacgf + + mov p2_temp1+8(%rsp),%ecx + mov %ecx,8(%rdi) # do the second value + cmp $4,%rax + jl .L__vsacgf + + mov p2_temp1+12(%rsp),%ecx + mov %ecx,12(%rdi) # do the second value + cmp $5,%rax + jl .L__vsacgf + + mov p2_temp1+16(%rsp),%ecx + mov %ecx,16(%rdi) # do the second value + cmp $6,%rax + jl .L__vsacgf + + mov p2_temp1+20(%rsp),%ecx + mov %ecx,20(%rdi) # do the second value + cmp $7,%rax + jl .L__vsacgf + + mov p2_temp1+24(%rsp),%ecx + mov %ecx,24(%rdi) # do the last value + +.L__vsacgf: + jmp .L__final_check + + +.L__vlogf_e: + movdqa p_x(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm0,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + jmp .L__f1 + +.L__vlogf_e2: + movdqa p_x2(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm1,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + jmp .L__f12 + + .align 16 +.L__near_one: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm3,p_omask(%rsp) # save ones mask + movaps p_x(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r +# u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm1 + divps %xmm2,%xmm1 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm1,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction +# u = u + u; + addps %xmm1,%xmm1 #u + movaps %xmm1,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm1,%xmm5 # Cu + movaps %xmm1,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm1 + mulps %xmm1,%xmm1 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm1 #u6(Cu+Du3) + addps %xmm1,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction +# loge to log10 + movaps %xmm3,%xmm5 #r1=r + pand .L__mask_lower(%rip),%xmm5 + subps %xmm5,%xmm3 + addps %xmm3,%xmm2 #r2 = r2 + (r-r1) + + movaps %xmm5,%xmm3 + movaps %xmm2,%xmm1 + + mulps .L__real_log10e_tail(%rip),%xmm2 + mulps .L__real_log10e_tail(%rip),%xmm3 + mulps .L__real_log10e_lead(%rip),%xmm1 + mulps .L__real_log10e_lead(%rip),%xmm5 + addps %xmm2,%xmm3 + addps %xmm1,%xmm3 + addps %xmm5,%xmm3 +# return r + r2; +# addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm0,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + + jmp .L__f4 + + + .align 16 +.L__near_one2: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm9,p_omask(%rsp) # save ones mask + movaps p_x2(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r + # u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm7 + divps %xmm2,%xmm7 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C + # correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm7,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction + # u = u + u; + addps %xmm7,%xmm7 #u + movaps %xmm7,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 + # r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm7,%xmm5 # Cu + movaps %xmm7,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm7 + mulps %xmm7,%xmm7 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm7 #u6(Cu+Du3) + addps %xmm7,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + + #loge to log10 + movaps %xmm3,%xmm5 #r1=r + pand .L__mask_lower(%rip),%xmm5 + subps %xmm5,%xmm3 + addps %xmm3,%xmm2 #r2 = r2 + (r-r1) + + movaps %xmm5,%xmm3 + movaps %xmm2,%xmm7 + + mulps .L__real_log10e_tail(%rip),%xmm2 + mulps .L__real_log10e_tail(%rip),%xmm3 + mulps .L__real_log10e_lead(%rip),%xmm7 + mulps .L__real_log10e_lead(%rip),%xmm5 + addps %xmm2,%xmm3 + addps %xmm7,%xmm3 + addps %xmm5,%xmm3 + # return r + r2; +# addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm1,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + + jmp .L__f42 + +# we have a zero, a negative number, or both. +# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf. +.L__z_or_neg: +# deal with negatives first + movdqa %xmm1,%xmm3 + andps %xmm0,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm1 # setup the nan values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace +# check for +/- 0 + xorps %xmm1,%xmm1 + cmpps $0,p_x(%rsp),%xmm1 # 0 ?. + movmskps %xmm1,%r9d + test $0x0f,%r9d + jz .L__zn2 + + movdqa %xmm1,%xmm3 + andnps %xmm0,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + +.L__zn2: +# check for NaNs + movaps p_x(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x(%rsp),%xmm1 # isolate the NaNs + pand %xmm4,%xmm1 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm1,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm1 + andnps %xmm0,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f2 + +# handle only +inf log(+inf) = inf +.L__log_inf: + movdqa %xmm3,%xmm1 + andnps %xmm0,%xmm3 # keep the non-error values + andps p_x(%rsp),%xmm1 # setup the +inf values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + jmp .L__f3 + + +.L__z_or_neg2: + # deal with negatives first + movdqa %xmm7,%xmm3 + andps %xmm1,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm7 # setup the nan values + orps %xmm3,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + # check for +/- 0 + xorps %xmm7,%xmm7 + cmpps $0,p_x2(%rsp),%xmm7 # 0 ?. + movmskps %xmm7,%r9d + test $0x0f,%r9d + jz .L__zn22 + + movdqa %xmm7,%xmm3 + andnps %xmm1,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm7 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + +.L__zn22: + # check for NaNs + movaps p_x2(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x2(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x2(%rsp),%xmm7 # isolate the NaNs + pand %xmm4,%xmm7 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm7,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm7 + andnps %xmm1,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f22 + + # handle only +inf log(+inf) = inf +.L__log_inf2: + movdqa %xmm9,%xmm7 + andnps %xmm1,%xmm9 # keep the non-error values + andps p_x2(%rsp),%xmm7 # setup the +inf values + orps %xmm9,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + jmp .L__f32 + + + .data + .align 64 + + +.L__real_zero: .quad 0x00000000000000000 # 1.0 + .quad 0x00000000000000000 +.L__real_one: .quad 0x03f8000003f800000 # 1.0 + .quad 0x03f8000003f800000 +.L__real_two: .quad 0x04000000040000000 # 1.0 + .quad 0x04000000040000000 +.L__real_ninf: .quad 0x0ff800000ff800000 # -inf + .quad 0x0ff800000ff800000 +.L__real_inf: .quad 0x07f8000007f800000 # +inf + .quad 0x07f8000007f800000 +.L__real_nan: .quad 0x07fc000007fc00000 # NaN + .quad 0x07fc000007fc00000 +.L__real_ef: .quad 0x0402DF854402DF854 # float e + .quad 0x0402DF854402DF854 + +.L__real_sign: .quad 0x08000000080000000 # sign bit + .quad 0x08000000080000000 +.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit + .quad 0x07ffFFFFF7ffFFFFF +.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit + .quad 0x00040000000400000 +.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits + .quad 0x0007FFFFF007FFFFF +.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */ + .quad 0x03c0000003c000000 +.L__mask_127: .quad 0x00000007f0000007f # + .quad 0x00000007f0000007f +.L__mask_040: .quad 0x00000004000000040 # + .quad 0x00000004000000040 +.L__mask_001: .quad 0x00000000100000001 # + .quad 0x00000000100000001 + + +.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03 + .quad 0x03CF5C28F3CF5C28F + +.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03 + .quad 0x03B1249183B124918 +.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04 + .quad 0x039E401A639E401A6 +.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03 + .quad 0x03B124A123B124A12 +.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375 + .quad 0x03F3170003F317000 +.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183 + .quad 0x03805FDF43805FDF4 +.L__real_half: .quad 0x03f0000003f000000 # 1/2 + .quad 0x03f0000003f000000 + +.L__real_log10e_lead: .quad 0x03EDE00003EDE0000 # log10e_lead 0.4335937500 + .quad 0x03EDE00003EDE0000 +.L__real_log10e_tail: .quad 0x03A37B1523A37B152 # log10e_tail 0.0007007319 + .quad 0x03A37B1523A37B152 + + +.L__mask_lower: .quad 0x0ffff0000ffff0000 # + .quad 0x0ffff0000ffff0000 + +.L__np_ln__table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02 + .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02 + .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02 + .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02 + .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02 + .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02 + .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01 + .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01 + .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01 + .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01 + .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01 + .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01 + .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01 + .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01 + .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01 + .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01 + .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01 + .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01 + .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01 + .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01 + .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01 + .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01 + .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01 + .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01 + .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01 + .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01 + .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01 + .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01 + .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01 + .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01 + .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01 + .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01 + .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01 + .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01 + .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01 + .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01 + .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01 + .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01 + .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01 + .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01 + .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01 + .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01 + .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01 + .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01 + .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01 + .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01 + .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01 + .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01 + .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01 + .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01 + .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01 + .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01 + .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01 + .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01 + .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01 + .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01 + .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01 + .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01 + .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01 + .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01 + .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01 + .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01 + .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01 + .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_lead_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x3C7E0000 # 0.015502929688 1 + .long 0x3CFC1000 # 0.030769348145 2 + .long 0x3D3BA000 # 0.045806884766 3 + .long 0x3D785000 # 0.060623168945 4 + .long 0x3D9A0000 # 0.075195312500 5 + .long 0x3DB78000 # 0.089599609375 6 + .long 0x3DD49000 # 0.103790283203 7 + .long 0x3DF13000 # 0.117767333984 8 + .long 0x3E06B000 # 0.131530761719 9 + .long 0x3E14A000 # 0.145141601563 10 + .long 0x3E226000 # 0.158569335938 11 + .long 0x3E2FF000 # 0.171813964844 12 + .long 0x3E3D5000 # 0.184875488281 13 + .long 0x3E4A9000 # 0.197814941406 14 + .long 0x3E579000 # 0.210510253906 15 + .long 0x3E647000 # 0.223083496094 16 + .long 0x3E713000 # 0.235534667969 17 + .long 0x3E7DC000 # 0.247802734375 18 + .long 0x3E851000 # 0.259887695313 19 + .long 0x3E8B3000 # 0.271850585938 20 + .long 0x3E914000 # 0.283691406250 21 + .long 0x3E974000 # 0.295410156250 22 + .long 0x3E9D3000 # 0.307006835938 23 + .long 0x3EA30000 # 0.318359375000 24 + .long 0x3EA8D000 # 0.329711914063 25 + .long 0x3EAE8000 # 0.340820312500 26 + .long 0x3EB43000 # 0.351928710938 27 + .long 0x3EB9C000 # 0.362792968750 28 + .long 0x3EBF5000 # 0.373657226563 29 + .long 0x3EC4D000 # 0.384399414063 30 + .long 0x3ECA3000 # 0.394897460938 31 + .long 0x3ECF9000 # 0.405395507813 32 + .long 0x3ED4E000 # 0.415771484375 33 + .long 0x3EDA2000 # 0.426025390625 34 + .long 0x3EDF5000 # 0.436157226563 35 + .long 0x3EE47000 # 0.446166992188 36 + .long 0x3EE99000 # 0.456176757813 37 + .long 0x3EEEA000 # 0.466064453125 38 + .long 0x3EF3A000 # 0.475830078125 39 + .long 0x3EF89000 # 0.485473632813 40 + .long 0x3EFD7000 # 0.494995117188 41 + .long 0x3F012000 # 0.504394531250 42 + .long 0x3F039000 # 0.513916015625 43 + .long 0x3F05F000 # 0.523193359375 44 + .long 0x3F084000 # 0.532226562500 45 + .long 0x3F0AA000 # 0.541503906250 46 + .long 0x3F0CF000 # 0.550537109375 47 + .long 0x3F0F4000 # 0.559570312500 48 + .long 0x3F118000 # 0.568359375000 49 + .long 0x3F13C000 # 0.577148437500 50 + .long 0x3F160000 # 0.585937500000 51 + .long 0x3F183000 # 0.594482421875 52 + .long 0x3F1A7000 # 0.603271484375 53 + .long 0x3F1C9000 # 0.611572265625 54 + .long 0x3F1EC000 # 0.620117187500 55 + .long 0x3F20E000 # 0.628417968750 56 + .long 0x3F230000 # 0.636718750000 57 + .long 0x3F252000 # 0.645019531250 58 + .long 0x3F273000 # 0.653076171875 59 + .long 0x3F295000 # 0.661376953125 60 + .long 0x3F2B5000 # 0.669189453125 61 + .long 0x3F2D6000 # 0.677246093750 62 + .long 0x3F2F7000 # 0.685302734375 63 + .long 0x3F317000 # 0.693115234375 64 + .long 0 # for alignment + +.L__np_ln_tail_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x35A8B0FC # 0.000001256848 1 + .long 0x361B0E78 # 0.000002310522 2 + .long 0x3631EC66 # 0.000002651266 3 + .long 0x35C30046 # 0.000001452871 4 + .long 0x37EBCB0E # 0.000028108738 5 + .long 0x37528AE5 # 0.000012549314 6 + .long 0x36DA7496 # 0.000006510479 7 + .long 0x3783B715 # 0.000015701671 8 + .long 0x383F3E68 # 0.000045596069 9 + .long 0x38297C10 # 0.000040408282 10 + .long 0x3815B666 # 0.000035694240 11 + .long 0x38183854 # 0.000036292084 12 + .long 0x38448108 # 0.000046850211 13 + .long 0x373539E9 # 0.000010801924 14 + .long 0x3864A740 # 0.000054515200 15 + .long 0x387BE3CD # 0.000060055219 16 + .long 0x3803B715 # 0.000031403342 17 + .long 0x380C36AF # 0.000033429529 18 + .long 0x3892713A # 0.000069829126 19 + .long 0x38AE55D6 # 0.000083129547 20 + .long 0x38A0FDE8 # 0.000076766883 21 + .long 0x3862BAE1 # 0.000054056643 22 + .long 0x3798AAD3 # 0.000018199358 23 + .long 0x38C5E10E # 0.000094356117 24 + .long 0x382D872E # 0.000041372310 25 + .long 0x38DEDFAC # 0.000106274470 26 + .long 0x38481E9B # 0.000047712219 27 + .long 0x38EBFB5E # 0.000112524940 28 + .long 0x38783B83 # 0.000059183232 29 + .long 0x374E1B05 # 0.000012284848 30 + .long 0x38CA0E11 # 0.000096347307 31 + .long 0x3891F660 # 0.000069600297 32 + .long 0x386C9A9A # 0.000056410769 33 + .long 0x38777BCD # 0.000059004688 34 + .long 0x38A6CED4 # 0.000079540216 35 + .long 0x38FBE3CD # 0.000120110439 36 + .long 0x387E7E01 # 0.000060675669 37 + .long 0x37D40984 # 0.000025276800 38 + .long 0x3784C3AD # 0.000015826745 39 + .long 0x380F5FAF # 0.000034182969 40 + .long 0x38AC47BC # 0.000082149607 41 + .long 0x392952D3 # 0.000161479504 42 + .long 0x37F97073 # 0.000029735476 43 + .long 0x3865C84A # 0.000054784388 44 + .long 0x3979CF17 # 0.000238236375 45 + .long 0x38C3D2F5 # 0.000093376184 46 + .long 0x38E6B468 # 0.000110008579 47 + .long 0x383EBCE1 # 0.000045475437 48 + .long 0x39186BDF # 0.000145360347 49 + .long 0x392F0945 # 0.000166927537 50 + .long 0x38E9ED45 # 0.000111545007 51 + .long 0x396B99A8 # 0.000224685878 52 + .long 0x37A27674 # 0.000019367064 53 + .long 0x397069AB # 0.000229275480 54 + .long 0x39013539 # 0.000123222257 55 + .long 0x3947F423 # 0.000190690669 56 + .long 0x3945E10E # 0.000188712234 57 + .long 0x38F85DB0 # 0.000118430122 58 + .long 0x396C08DC # 0.000225100142 59 + .long 0x37B4996F # 0.000021529120 60 + .long 0x397CEADA # 0.000241200818 61 + .long 0x3920261B # 0.000152729845 62 + .long 0x35AA4906 # 0.000001268724 63 + .long 0x3805FDF4 # 0.000031946183 64 + .long 0 # for alignment +
diff --git a/src/gas/vrsalog2f.S b/src/gas/vrsalog2f.S new file mode 100644 index 0000000..9760d9f --- /dev/null +++ b/src/gas/vrsalog2f.S
@@ -0,0 +1,1140 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrsalog2f.s +# +# An array implementation of the logf libm function. +# +# Prototype: +# +# void vrsa_log2f(int n, float *x, float *y); +# +# Computes the natural log of x. +# Places the results into the supplied y array. +# Does not perform error handling, but does return C99 values for error +# inputs. Denormal results are truncated to 0. + +# This array version is basically a unrolling of the by4 scalar single +# routine. The second set of operations is performed by the indented +# instructions interleaved into the first set. +# +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + .weak vrsa_log2f_ + .set vrsa_log2f_,__vrsa_log2f__ + .weak vrsa_log2f__ + .set vrsa_log2f__,__vrsa_log2f__ + + .text + .align 16 + .p2align 4,,15 + +#/* a FORTRAN subroutine implementation of array logf +#** VRSA_LOG2F(N,X,Y) +# C equivalent*/ +#void vrsa_log2f__(int * n, float *x, float *y) +#{ +# vrsa_log2f(*n,x,y); +#} +.globl __vrsa_log2f__ + .type __vrsa_log2f__,@function +__vrsa_log2f__: + mov (%rdi),%edi + + .align 16 + .p2align 4,,15 + + +# define local variable storage offsets +.equ p_x,0 # save x +.equ p_idx,0x010 # xmmword index +.equ p_z1,0x020 # xmmword index +.equ p_q,0x030 # xmmword index +.equ p_corr,0x040 # xmmword index +.equ p_omask,0x050 # xmmword index + + +.equ p_x2,0x0100 # save x +.equ p_idx2,0x0110 # xmmword index +.equ p_z12,0x0120 # xmmword index +.equ p_q2,0x0130 # xmmword index + +.equ save_xa,0x0140 #qword +.equ save_ya,0x0148 #qword +.equ save_nv,0x0150 #qword +.equ p_iter,0x0158 # qword storage for number of loop iterations + +.equ save_rbx,0x0160 # +.equ save_rdi,0x0168 #qword +.equ save_rsi,0x0170 #qword + +.equ p2_temp,0x0180 #qword +.equ p2_temp1,0x01a0 #qword + +.equ stack_size,0x01c8 + + + + +# parameters are passed in by gcc as: +# rdi - int n +# rsi - double *x +# rdx - double *y + +.globl vrsa_log2f + .type vrsa_log2f,@function +vrsa_log2f: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rdi + +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + mov %rdi,save_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $3,%rax # get number of iterations + jz .L__vsa_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $3,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +# In this second version, process the array 8 values at a time. + +.L__vsa_top: +# build the input _m128 + mov save_xa(%rsp),%rsi # get x_array pointer + movups (%rsi),%xmm0 + movups 16(%rsi),%xmm12 +# movhps .LQWORD,%xmm0 PTR [rsi+8] + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + +# check e as a special case + movdqa %xmm0,p_x(%rsp) # save x + movdqa %xmm12,p_x2(%rsp) # save x + movdqa %xmm0,%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movmskps %xmm2,%r9d + + movdqa %xmm12,%xmm9 + movaps %xmm12,%xmm7 + +# +# compute the index into the log tables +# + movdqa %xmm0,%xmm3 + movaps %xmm0,%xmm1 + psrld $23,%xmm3 + + # + # compute the index into the log tables + # + psrld $23,%xmm9 + subps .L__real_one(%rip),%xmm7 + psubd .L__mask_127(%rip),%xmm9 + subps .L__real_one(%rip),%xmm1 + psubd .L__mask_127(%rip),%xmm3 + cvtdq2ps %xmm9,%xmm13 # xexp + + movdqa %xmm12,%xmm9 + pand .L__real_mant(%rip),%xmm9 + xor %r8,%r8 + movdqa %xmm9,%xmm8 + movaps .L__real_half(%rip),%xmm11 # .5 + cvtdq2ps %xmm3,%xmm6 # xexp + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + xor %r8,%r8 + movdqa %xmm3,%xmm2 + movaps .L__real_half(%rip),%xmm5 # .5 + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrld $16,%xmm3 + lea .L__np_ln_lead_table(%rip),%rdx + movdqa %xmm3,%xmm4 + psrld $16,%xmm9 + movdqa %xmm9,%xmm10 + psrld $1,%xmm9 + psrld $1,%xmm3 + paddd .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm4 + paddd %xmm4,%xmm3 + cvtdq2ps %xmm3,%xmm1 + #/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + paddd .L__mask_040(%rip),%xmm9 + pand .L__mask_001(%rip),%xmm10 + paddd %xmm10,%xmm9 + cvtdq2ps %xmm9,%xmm7 + packssdw %xmm3,%xmm3 + movq %xmm3,p_idx(%rsp) + packssdw %xmm9,%xmm9 + movq %xmm9,p_idx2(%rsp) + + +# reduce and get u + movdqa %xmm0,%xmm3 + orps .L__real_half(%rip),%xmm2 + + + mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128 + # reduce and get u + + + subps %xmm1,%xmm2 # f2 = f - f1 + mulps %xmm2,%xmm5 + addps %xmm5,%xmm1 + + divps %xmm1,%xmm2 # u + + movdqa %xmm12,%xmm9 + orps .L__real_half(%rip),%xmm8 + + + mulps .L__real_3c000000(%rip),%xmm7 # f1 = index/128 + subps %xmm7,%xmm8 # f2 = f - f1 + mulps %xmm8,%xmm11 + addps %xmm11,%xmm7 + + + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1(%rsp) # save the f1 values + + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1+8(%rsp) # save the f1 value + + divps %xmm7,%xmm8 # u + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx2(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z12(%rsp) # save the f1 values + + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z12+8(%rsp) # save the f1 value + +# solve for ln(1+u) + movaps %xmm2,%xmm1 # u + mulps %xmm2,%xmm2 # u^2 + movaps %xmm2,%xmm5 + movaps .L__real_cb3(%rip),%xmm3 + mulps %xmm2,%xmm3 #Cu2 + mulps %xmm1,%xmm5 # u^3 + addps .L__real_cb2(%rip),%xmm3 #B+Cu2 + movaps %xmm2,%xmm4 + mulps %xmm5,%xmm4 # u^5 + movaps .L__real_log2e_lead(%rip),%xmm2 + + mulps .L__real_cb1(%rip),%xmm5 #Au3 + addps %xmm5,%xmm1 # u+Au3 + mulps %xmm3,%xmm4 # u5(B+Cu2) + movaps .L__real_log2e_tail(%rip),%xmm3 + lea .L__np_ln_tail_table(%rip),%rdx + addps %xmm4,%xmm1 # poly + +# recombine + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q(%rsp) # save the f2 value + + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q+8(%rsp) # save the f2 value + + addps p_q(%rsp),%xmm1 #z2 +=q + movaps %xmm1,%xmm4 #z2 copy + movaps p_z1(%rsp),%xmm0 # z1 values + movaps %xmm0,%xmm5 #z1 copy + + mulps %xmm2,%xmm5 #z1*log2e_lead + mulps %xmm2,%xmm1 #z2*log2e_lead + mulps %xmm3,%xmm4 #z2*log2e_tail + mulps %xmm3,%xmm0 #z1*log2e_tail + addps %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp + addps %xmm4,%xmm0 #z1*log2e_tail + z2*log2e_tail + addps %xmm1,%xmm0 #r2 +#return r1+r2 + addps %xmm5,%xmm0 # r1+ r2 + + +# check for e +# test $0x0f,%r9d +# jnz .L__vlogf_e +.L__f1: + +# check for negative numbers or zero + xorps %xmm1,%xmm1 + cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also. + movmskps %xmm1,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg + +.L__f2: +### if +inf + movaps p_x(%rsp),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__log_inf +.L__f3: + + movaps p_x(%rsp),%xmm3 + subps .L__real_one(%rip),%xmm3 + andps .L__real_notsign(%rip),%xmm3 + cmpps $2,.L__real_threshold(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__near_one +.L__f4: + +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + movups %xmm0,(%rdi) + +# finish the second set of calculations + + # solve for ln(1+u) + movaps %xmm8,%xmm7 # u + mulps %xmm8,%xmm8 # u^2 + movaps %xmm8,%xmm11 + + movaps .L__real_cb3(%rip),%xmm9 + mulps %xmm8,%xmm9 #Cu2 + mulps %xmm7,%xmm11 # u^3 + addps .L__real_cb2(%rip),%xmm9 #B+Cu2 + movaps %xmm8,%xmm10 + mulps %xmm11,%xmm10 # u^5 + movaps .L__real_log2e_lead(%rip),%xmm8 + + mulps .L__real_cb1(%rip),%xmm11 #Au3 + addps %xmm11,%xmm7 # u+Au3 + mulps %xmm9,%xmm10 # u5(B+Cu2) + movaps .L__real_log2e_tail(%rip),%xmm9 + addps %xmm10,%xmm7 # poly + + + # recombine + lea .L__np_ln_tail_table(%rip),%rdx + mov p_idx2(%rsp),%rcx # get the indexes + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q2(%rsp) # save the f2 value + + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q2+8(%rsp) # save the f2 value + + addps p_q2(%rsp),%xmm7 #z2 +=q + movaps %xmm7,%xmm10 #z2 copy + movaps p_z12(%rsp),%xmm1 # z1 values + movaps %xmm1,%xmm11 #z1 copy + + mulps %xmm8,%xmm11 #z1*log2e_lead + mulps %xmm8,%xmm7 #z2*log2e_lead + mulps %xmm9,%xmm10 #z2*log2e_tail + mulps %xmm9,%xmm1 #z1*log2e_tail + addps %xmm13,%xmm11 #r1 = z1*log2e_lead + xexp + addps %xmm10,%xmm1 #z1*log2e_tail + z2*log2e_tail + addps %xmm7,%xmm1 #r2 + #return r1+r2 + addps %xmm11,%xmm1 # r1+ r2 + + # check e as a special case +# movaps p_x2(%rsp),%xmm10 +# cmpps $0,.L__real_ef(%rip),%xmm10 +# movmskps %xmm10,%r9d + # check for e +# test $0x0f,%r9d +# jnz .L__vlogf_e2 +.L__f12: + + # check for negative numbers or zero + xorps %xmm7,%xmm7 + cmpps $1,p_x2(%rsp),%xmm7 # 0 greater than =?. catches NaNs also. + movmskps %xmm7,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg2 + +.L__f22: + ### if +inf + movaps p_x2(%rsp),%xmm9 + cmpps $0,.L__real_inf(%rip),%xmm9 + movmskps %xmm9,%r9d + test $0x0f,%r9d + jnz .L__log_inf2 +.L__f32: + + movaps p_x2(%rsp),%xmm9 + subps .L__real_one(%rip),%xmm9 + andps .L__real_notsign(%rip),%xmm9 + cmpps $2,.L__real_threshold(%rip),%xmm9 + movmskps %xmm9,%r9d + test $0x0f,%r9d + jnz .L__near_one2 +.L__f42: + + + prefetch 64(%rsi) + add $32,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + +# store the result _m128d + movups %xmm1,-16(%rdi) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vsa_top + + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vsa_cleanup + + +# +.L__final_check: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + + + + .align 16 +# we jump here when we have an odd number of log calls to make at the +# end +.L__vsa_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__final_check # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + +# fill in a m128 with zeroes and the extra values and then make a recursive call. + xorps %xmm0,%xmm0 + movaps %xmm0,p2_temp(%rsp) + movaps %xmm0,p2_temp+16(%rsp) + + mov (%rsi),%ecx # we know there's at least one + mov %ecx,p2_temp(%rsp) + cmp $2,%rax + jl .L__vsacg + + mov 4(%rsi),%ecx # do the second value + mov %ecx,p2_temp+4(%rsp) + cmp $3,%rax + jl .L__vsacg + + mov 8(%rsi),%ecx # do the third value + mov %ecx,p2_temp+8(%rsp) + cmp $4,%rax + jl .L__vsacg + + mov 12(%rsi),%ecx # do the fourth value + mov %ecx,p2_temp+12(%rsp) + cmp $5,%rax + jl .L__vsacg + + mov 16(%rsi),%ecx # do the fifth value + mov %ecx,p2_temp+16(%rsp) + cmp $6,%rax + jl .L__vsacg + + mov 20(%rsi),%ecx # do the sixth value + mov %ecx,p2_temp+20(%rsp) + cmp $7,%rax + jl .L__vsacg + + mov 24(%rsi),%ecx # do the last value + mov %ecx,p2_temp+24(%rsp) + +.L__vsacg: + mov $8,%rdi # parameter for N + lea p2_temp(%rsp),%rsi # &x parameter + lea p2_temp1(%rsp),%rdx # &y parameter + call vrsa_log2f@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + mov p2_temp1(%rsp),%ecx + mov %ecx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vsacgf + + mov p2_temp1+4(%rsp),%ecx + mov %ecx,4(%rdi) # do the second value + cmp $3,%rax + jl .L__vsacgf + + mov p2_temp1+8(%rsp),%ecx + mov %ecx,8(%rdi) # do the second value + cmp $4,%rax + jl .L__vsacgf + + mov p2_temp1+12(%rsp),%ecx + mov %ecx,12(%rdi) # do the second value + cmp $5,%rax + jl .L__vsacgf + + mov p2_temp1+16(%rsp),%ecx + mov %ecx,16(%rdi) # do the second value + cmp $6,%rax + jl .L__vsacgf + + mov p2_temp1+20(%rsp),%ecx + mov %ecx,20(%rdi) # do the second value + cmp $7,%rax + jl .L__vsacgf + + mov p2_temp1+24(%rsp),%ecx + mov %ecx,24(%rdi) # do the last value + +.L__vsacgf: + jmp .L__final_check + + +.L__vlogf_e: + movdqa p_x(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm0,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + jmp .L__f1 + +.L__vlogf_e2: + movdqa p_x2(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm1,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + jmp .L__f12 + + .align 16 +.L__near_one: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm3,p_omask(%rsp) # save ones mask + movaps p_x(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r +# u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm1 + divps %xmm2,%xmm1 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm1,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction +# u = u + u; + addps %xmm1,%xmm1 #u + movaps %xmm1,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm1,%xmm5 # Cu + movaps %xmm1,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm1 + mulps %xmm1,%xmm1 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm1 #u6(Cu+Du3) + addps %xmm1,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + + +# loge to log2 + movaps %xmm3,%xmm5 #r1=r + pand .L__mask_lower(%rip),%xmm5 + subps %xmm5,%xmm3 + addps %xmm3,%xmm2 #r2 = r2 + (r-r1) + + movaps %xmm5,%xmm3 + movaps %xmm2,%xmm1 + + mulps .L__real_log2e_tail(%rip),%xmm2 + mulps .L__real_log2e_tail(%rip),%xmm3 + mulps .L__real_log2e_lead(%rip),%xmm1 + mulps .L__real_log2e_lead(%rip),%xmm5 + addps %xmm2,%xmm3 + addps %xmm1,%xmm3 + addps %xmm5,%xmm3 + +# return r + r2; +# addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm0,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + + jmp .L__f4 + + + .align 16 +.L__near_one2: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm9,p_omask(%rsp) # save ones mask + movaps p_x2(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r + # u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm7 + divps %xmm2,%xmm7 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C + # correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm7,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction + # u = u + u; + addps %xmm7,%xmm7 #u + movaps %xmm7,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 + # r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm7,%xmm5 # Cu + movaps %xmm7,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm7 + mulps %xmm7,%xmm7 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm7 #u6(Cu+Du3) + addps %xmm7,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + +# loge to log2 + movaps %xmm3,%xmm5 #r1=r + pand .L__mask_lower(%rip),%xmm5 + subps %xmm5,%xmm3 + addps %xmm3,%xmm2 #r2 = r2 + (r-r1) + + movaps %xmm5,%xmm3 + movaps %xmm2,%xmm7 + + mulps .L__real_log2e_tail(%rip),%xmm2 + mulps .L__real_log2e_tail(%rip),%xmm3 + mulps .L__real_log2e_lead(%rip),%xmm7 + mulps .L__real_log2e_lead(%rip),%xmm5 + addps %xmm2,%xmm3 + addps %xmm7,%xmm3 + addps %xmm5,%xmm3 + + # return r + r2; +# addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm1,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + + jmp .L__f42 + +# we have a zero, a negative number, or both. +# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf. +.L__z_or_neg: +# deal with negatives first + movdqa %xmm1,%xmm3 + andps %xmm0,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm1 # setup the nan values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace +# check for +/- 0 + xorps %xmm1,%xmm1 + cmpps $0,p_x(%rsp),%xmm1 # 0 ?. + movmskps %xmm1,%r9d + test $0x0f,%r9d + jz .L__zn2 + + movdqa %xmm1,%xmm3 + andnps %xmm0,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + +.L__zn2: +# check for NaNs + movaps p_x(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x(%rsp),%xmm1 # isolate the NaNs + pand %xmm4,%xmm1 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm1,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm1 + andnps %xmm0,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f2 + +# handle only +inf log(+inf) = inf +.L__log_inf: + movdqa %xmm3,%xmm1 + andnps %xmm0,%xmm3 # keep the non-error values + andps p_x(%rsp),%xmm1 # setup the +inf values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + jmp .L__f3 + + +.L__z_or_neg2: + # deal with negatives first + movdqa %xmm7,%xmm3 + andps %xmm1,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm7 # setup the nan values + orps %xmm3,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + # check for +/- 0 + xorps %xmm7,%xmm7 + cmpps $0,p_x2(%rsp),%xmm7 # 0 ?. + movmskps %xmm7,%r9d + test $0x0f,%r9d + jz .L__zn22 + + movdqa %xmm7,%xmm3 + andnps %xmm1,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm7 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + +.L__zn22: + # check for NaNs + movaps p_x2(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x2(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x2(%rsp),%xmm7 # isolate the NaNs + pand %xmm4,%xmm7 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm7,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm7 + andnps %xmm1,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f22 + + # handle only +inf log(+inf) = inf +.L__log_inf2: + movdqa %xmm9,%xmm7 + andnps %xmm1,%xmm9 # keep the non-error values + andps p_x2(%rsp),%xmm7 # setup the +inf values + orps %xmm9,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + jmp .L__f32 + + + .data + .align 64 + + +.L__real_zero: .quad 0x00000000000000000 # 1.0 + .quad 0x00000000000000000 +.L__real_one: .quad 0x03f8000003f800000 # 1.0 + .quad 0x03f8000003f800000 +.L__real_two: .quad 0x04000000040000000 # 1.0 + .quad 0x04000000040000000 +.L__real_ninf: .quad 0x0ff800000ff800000 # -inf + .quad 0x0ff800000ff800000 +.L__real_inf: .quad 0x07f8000007f800000 # +inf + .quad 0x07f8000007f800000 +.L__real_nan: .quad 0x07fc000007fc00000 # NaN + .quad 0x07fc000007fc00000 +.L__real_ef: .quad 0x0402DF854402DF854 # float e + .quad 0x0402DF854402DF854 + +.L__real_sign: .quad 0x08000000080000000 # sign bit + .quad 0x08000000080000000 +.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit + .quad 0x07ffFFFFF7ffFFFFF +.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit + .quad 0x00040000000400000 +.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits + .quad 0x0007FFFFF007FFFFF +.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */ + .quad 0x03c0000003c000000 +.L__mask_127: .quad 0x00000007f0000007f # + .quad 0x00000007f0000007f +.L__mask_040: .quad 0x00000004000000040 # + .quad 0x00000004000000040 +.L__mask_001: .quad 0x00000000100000001 # + .quad 0x00000000100000001 + + +.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03 + .quad 0x03CF5C28F3CF5C28F + +.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03 + .quad 0x03B1249183B124918 +.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04 + .quad 0x039E401A639E401A6 +.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03 + .quad 0x03B124A123B124A12 +.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375 + .quad 0x03F3170003F317000 +.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183 + .quad 0x03805FDF43805FDF4 +.L__real_half: .quad 0x03f0000003f000000 # 1/2 + .quad 0x03f0000003f000000 +.L__real_log2e_lead: .quad 0x03FB800003FB80000 #1.4375000000 + .quad 0x03FB800003FB80000 +.L__real_log2e_tail: .quad 0x03BAA3B293BAA3B29 # 0.0051950408889633 + .quad 0x03BAA3B293BAA3B29 + +.L__mask_lower: .quad 0x0ffff0000ffff0000 # + .quad 0x0ffff0000ffff0000 + +.L__np_ln__table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02 + .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02 + .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02 + .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02 + .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02 + .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02 + .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01 + .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01 + .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01 + .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01 + .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01 + .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01 + .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01 + .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01 + .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01 + .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01 + .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01 + .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01 + .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01 + .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01 + .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01 + .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01 + .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01 + .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01 + .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01 + .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01 + .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01 + .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01 + .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01 + .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01 + .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01 + .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01 + .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01 + .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01 + .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01 + .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01 + .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01 + .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01 + .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01 + .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01 + .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01 + .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01 + .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01 + .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01 + .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01 + .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01 + .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01 + .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01 + .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01 + .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01 + .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01 + .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01 + .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01 + .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01 + .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01 + .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01 + .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01 + .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01 + .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01 + .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01 + .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01 + .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01 + .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01 + .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_lead_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x3C7E0000 # 0.015502929688 1 + .long 0x3CFC1000 # 0.030769348145 2 + .long 0x3D3BA000 # 0.045806884766 3 + .long 0x3D785000 # 0.060623168945 4 + .long 0x3D9A0000 # 0.075195312500 5 + .long 0x3DB78000 # 0.089599609375 6 + .long 0x3DD49000 # 0.103790283203 7 + .long 0x3DF13000 # 0.117767333984 8 + .long 0x3E06B000 # 0.131530761719 9 + .long 0x3E14A000 # 0.145141601563 10 + .long 0x3E226000 # 0.158569335938 11 + .long 0x3E2FF000 # 0.171813964844 12 + .long 0x3E3D5000 # 0.184875488281 13 + .long 0x3E4A9000 # 0.197814941406 14 + .long 0x3E579000 # 0.210510253906 15 + .long 0x3E647000 # 0.223083496094 16 + .long 0x3E713000 # 0.235534667969 17 + .long 0x3E7DC000 # 0.247802734375 18 + .long 0x3E851000 # 0.259887695313 19 + .long 0x3E8B3000 # 0.271850585938 20 + .long 0x3E914000 # 0.283691406250 21 + .long 0x3E974000 # 0.295410156250 22 + .long 0x3E9D3000 # 0.307006835938 23 + .long 0x3EA30000 # 0.318359375000 24 + .long 0x3EA8D000 # 0.329711914063 25 + .long 0x3EAE8000 # 0.340820312500 26 + .long 0x3EB43000 # 0.351928710938 27 + .long 0x3EB9C000 # 0.362792968750 28 + .long 0x3EBF5000 # 0.373657226563 29 + .long 0x3EC4D000 # 0.384399414063 30 + .long 0x3ECA3000 # 0.394897460938 31 + .long 0x3ECF9000 # 0.405395507813 32 + .long 0x3ED4E000 # 0.415771484375 33 + .long 0x3EDA2000 # 0.426025390625 34 + .long 0x3EDF5000 # 0.436157226563 35 + .long 0x3EE47000 # 0.446166992188 36 + .long 0x3EE99000 # 0.456176757813 37 + .long 0x3EEEA000 # 0.466064453125 38 + .long 0x3EF3A000 # 0.475830078125 39 + .long 0x3EF89000 # 0.485473632813 40 + .long 0x3EFD7000 # 0.494995117188 41 + .long 0x3F012000 # 0.504394531250 42 + .long 0x3F039000 # 0.513916015625 43 + .long 0x3F05F000 # 0.523193359375 44 + .long 0x3F084000 # 0.532226562500 45 + .long 0x3F0AA000 # 0.541503906250 46 + .long 0x3F0CF000 # 0.550537109375 47 + .long 0x3F0F4000 # 0.559570312500 48 + .long 0x3F118000 # 0.568359375000 49 + .long 0x3F13C000 # 0.577148437500 50 + .long 0x3F160000 # 0.585937500000 51 + .long 0x3F183000 # 0.594482421875 52 + .long 0x3F1A7000 # 0.603271484375 53 + .long 0x3F1C9000 # 0.611572265625 54 + .long 0x3F1EC000 # 0.620117187500 55 + .long 0x3F20E000 # 0.628417968750 56 + .long 0x3F230000 # 0.636718750000 57 + .long 0x3F252000 # 0.645019531250 58 + .long 0x3F273000 # 0.653076171875 59 + .long 0x3F295000 # 0.661376953125 60 + .long 0x3F2B5000 # 0.669189453125 61 + .long 0x3F2D6000 # 0.677246093750 62 + .long 0x3F2F7000 # 0.685302734375 63 + .long 0x3F317000 # 0.693115234375 64 + .long 0 # for alignment + +.L__np_ln_tail_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x35A8B0FC # 0.000001256848 1 + .long 0x361B0E78 # 0.000002310522 2 + .long 0x3631EC66 # 0.000002651266 3 + .long 0x35C30046 # 0.000001452871 4 + .long 0x37EBCB0E # 0.000028108738 5 + .long 0x37528AE5 # 0.000012549314 6 + .long 0x36DA7496 # 0.000006510479 7 + .long 0x3783B715 # 0.000015701671 8 + .long 0x383F3E68 # 0.000045596069 9 + .long 0x38297C10 # 0.000040408282 10 + .long 0x3815B666 # 0.000035694240 11 + .long 0x38183854 # 0.000036292084 12 + .long 0x38448108 # 0.000046850211 13 + .long 0x373539E9 # 0.000010801924 14 + .long 0x3864A740 # 0.000054515200 15 + .long 0x387BE3CD # 0.000060055219 16 + .long 0x3803B715 # 0.000031403342 17 + .long 0x380C36AF # 0.000033429529 18 + .long 0x3892713A # 0.000069829126 19 + .long 0x38AE55D6 # 0.000083129547 20 + .long 0x38A0FDE8 # 0.000076766883 21 + .long 0x3862BAE1 # 0.000054056643 22 + .long 0x3798AAD3 # 0.000018199358 23 + .long 0x38C5E10E # 0.000094356117 24 + .long 0x382D872E # 0.000041372310 25 + .long 0x38DEDFAC # 0.000106274470 26 + .long 0x38481E9B # 0.000047712219 27 + .long 0x38EBFB5E # 0.000112524940 28 + .long 0x38783B83 # 0.000059183232 29 + .long 0x374E1B05 # 0.000012284848 30 + .long 0x38CA0E11 # 0.000096347307 31 + .long 0x3891F660 # 0.000069600297 32 + .long 0x386C9A9A # 0.000056410769 33 + .long 0x38777BCD # 0.000059004688 34 + .long 0x38A6CED4 # 0.000079540216 35 + .long 0x38FBE3CD # 0.000120110439 36 + .long 0x387E7E01 # 0.000060675669 37 + .long 0x37D40984 # 0.000025276800 38 + .long 0x3784C3AD # 0.000015826745 39 + .long 0x380F5FAF # 0.000034182969 40 + .long 0x38AC47BC # 0.000082149607 41 + .long 0x392952D3 # 0.000161479504 42 + .long 0x37F97073 # 0.000029735476 43 + .long 0x3865C84A # 0.000054784388 44 + .long 0x3979CF17 # 0.000238236375 45 + .long 0x38C3D2F5 # 0.000093376184 46 + .long 0x38E6B468 # 0.000110008579 47 + .long 0x383EBCE1 # 0.000045475437 48 + .long 0x39186BDF # 0.000145360347 49 + .long 0x392F0945 # 0.000166927537 50 + .long 0x38E9ED45 # 0.000111545007 51 + .long 0x396B99A8 # 0.000224685878 52 + .long 0x37A27674 # 0.000019367064 53 + .long 0x397069AB # 0.000229275480 54 + .long 0x39013539 # 0.000123222257 55 + .long 0x3947F423 # 0.000190690669 56 + .long 0x3945E10E # 0.000188712234 57 + .long 0x38F85DB0 # 0.000118430122 58 + .long 0x396C08DC # 0.000225100142 59 + .long 0x37B4996F # 0.000021529120 60 + .long 0x397CEADA # 0.000241200818 61 + .long 0x3920261B # 0.000152729845 62 + .long 0x35AA4906 # 0.000001268724 63 + .long 0x3805FDF4 # 0.000031946183 64 + .long 0 # for alignment +
diff --git a/src/gas/vrsalogf.S b/src/gas/vrsalogf.S new file mode 100644 index 0000000..1f96523 --- /dev/null +++ b/src/gas/vrsalogf.S
@@ -0,0 +1,1088 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrsalogf.s +# +# An array implementation of the logf libm function. +# +# Prototype: +# +# void vrsa_logf(int n, float *x, float *y); +# +# Computes the natural log of x. +# Places the results into the supplied y array. +# Does not perform error handling, but does return C99 values for error +# inputs. Denormal results are truncated to 0. + +# This array version is basically a unrolling of the by4 scalar single +# routine. The second set of operations is performed by the indented +# instructions interleaved into the first set. +# +# +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + .weak vrsa_logf_ + .set vrsa_logf_,__vrsa_logf__ + .weak vrsa_logf__ + .set vrsa_logf__,__vrsa_logf__ + + .text + .align 16 + .p2align 4,,15 + +#/* a FORTRAN subroutine implementation of array logf +#** VRSA_LOGF(N,X,Y) +# C equivalent*/ +#void vrsa_logf__(int * n, float *x, float *y) +#{ +# vrsa_logf(*n,x,y); +#} +.globl __vrsa_logf__ + .type __vrsa_logf__,@function +__vrsa_logf__: + mov (%rdi),%edi + + .align 16 + .p2align 4,,15 + + +# define local variable storage offsets +.equ p_x,0 # save x +.equ p_idx,0x010 # xmmword index +.equ p_z1,0x020 # xmmword index +.equ p_q,0x030 # xmmword index +.equ p_corr,0x040 # xmmword index +.equ p_omask,0x050 # xmmword index + + +.equ p_x2,0x0100 # save x +.equ p_idx2,0x0110 # xmmword index +.equ p_z12,0x0120 # xmmword index +.equ p_q2,0x0130 # xmmword index + +.equ save_xa,0x0140 #qword +.equ save_ya,0x0148 #qword +.equ save_nv,0x0150 #qword +.equ p_iter,0x0158 # qword storage for number of loop iterations + +.equ save_rbx,0x0160 # +.equ save_rdi,0x0168 #qword +.equ save_rsi,0x0170 #qword + +.equ p2_temp,0x0180 #qword +.equ p2_temp1,0x01a0 #qword + +.equ stack_size,0x01c8 + + + + +# parameters are passed in by gcc as: +# rdi - int n +# rsi - double *x +# rdx - double *y + +.globl vrsa_logf + .type vrsa_logf,@function +vrsa_logf: + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rdi + +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + mov %rdi,save_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $3,%rax # get number of iterations + jz .L__vsa_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $3,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +# In this second version, process the array 8 values at a time. + +.L__vsa_top: +# build the input _m128 + mov save_xa(%rsp),%rsi # get x_array pointer + movups (%rsi),%xmm0 + movups 16(%rsi),%xmm12 +# movhps .LQWORD,%xmm0 PTR [rsi+8] + prefetch 64(%rsi) + add $32,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + +# check e as a special case + movdqa %xmm0,p_x(%rsp) # save x + movdqa %xmm12,p_x2(%rsp) # save x + movdqa %xmm0,%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movmskps %xmm2,%r9d + + movdqa %xmm12,%xmm9 + movaps %xmm12,%xmm7 + +# +# compute the index into the log tables +# + movdqa %xmm0,%xmm3 + movaps %xmm0,%xmm1 + psrld $23,%xmm3 + + # + # compute the index into the log tables + # + psrld $23,%xmm9 + subps .L__real_one(%rip),%xmm7 + psubd .L__mask_127(%rip),%xmm9 + subps .L__real_one(%rip),%xmm1 + psubd .L__mask_127(%rip),%xmm3 + cvtdq2ps %xmm9,%xmm13 # xexp + + movdqa %xmm12,%xmm9 + pand .L__real_mant(%rip),%xmm9 + xor %r8,%r8 + movdqa %xmm9,%xmm8 + movaps .L__real_half(%rip),%xmm11 # .5 + cvtdq2ps %xmm3,%xmm6 # xexp + + movdqa %xmm0,%xmm3 + pand .L__real_mant(%rip),%xmm3 + xor %r8,%r8 + movdqa %xmm3,%xmm2 + movaps .L__real_half(%rip),%xmm5 # .5 + +#/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + psrld $16,%xmm3 + lea .L__np_ln_lead_table(%rip),%rdx + movdqa %xmm3,%xmm4 + psrld $16,%xmm9 + movdqa %xmm9,%xmm10 + psrld $1,%xmm9 + psrld $1,%xmm3 + paddd .L__mask_040(%rip),%xmm3 + pand .L__mask_001(%rip),%xmm4 + paddd %xmm4,%xmm3 + cvtdq2ps %xmm3,%xmm1 + #/* Now x = 2**xexp * f, 1/2 <= f < 1. */ + paddd .L__mask_040(%rip),%xmm9 + pand .L__mask_001(%rip),%xmm10 + paddd %xmm10,%xmm9 + cvtdq2ps %xmm9,%xmm7 + packssdw %xmm3,%xmm3 + movq %xmm3,p_idx(%rsp) + packssdw %xmm9,%xmm9 + movq %xmm9,p_idx2(%rsp) + + +# reduce and get u + movdqa %xmm0,%xmm3 + orps .L__real_half(%rip),%xmm2 + + + mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128 + # reduce and get u + + + subps %xmm1,%xmm2 # f2 = f - f1 + mulps %xmm2,%xmm5 + addps %xmm5,%xmm1 + + divps %xmm1,%xmm2 # u + + movdqa %xmm12,%xmm9 + orps .L__real_half(%rip),%xmm8 + + + mulps .L__real_3c000000(%rip),%xmm7 # f1 = index/128 + subps %xmm7,%xmm8 # f2 = f - f1 + mulps %xmm8,%xmm11 + addps %xmm11,%xmm7 + + + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1(%rsp) # save the f1 values + + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z1+8(%rsp) # save the f1 value + + divps %xmm7,%xmm8 # u + lea .L__np_ln_lead_table(%rip),%rdx + mov p_idx2(%rsp),%rcx # get the indexes + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z12(%rsp) # save the f1 values + + + mov %cx,%r8w + ror $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f1 value + + mov %cx,%r8w + ror $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f1 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_z12+8(%rsp) # save the f1 value + +# solve for ln(1+u) + movaps %xmm2,%xmm1 # u + mulps %xmm2,%xmm2 # u^2 + movaps %xmm2,%xmm5 + movaps .L__real_cb3(%rip),%xmm3 + mulps %xmm2,%xmm3 #Cu2 + mulps %xmm1,%xmm5 # u^3 + addps .L__real_cb2(%rip),%xmm3 #B+Cu2 + movaps %xmm2,%xmm4 + mulps %xmm5,%xmm4 # u^5 + movaps .L__real_log2_lead(%rip),%xmm2 + + mulps .L__real_cb1(%rip),%xmm5 #Au3 + addps %xmm5,%xmm1 # u+Au3 + mulps %xmm3,%xmm4 # u5(B+Cu2) + + lea .L__np_ln_tail_table(%rip),%rdx + addps %xmm4,%xmm1 # poly + +# recombine + mov p_idx(%rsp),%rcx # get the indexes + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q(%rsp) # save the f2 value + + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q+8(%rsp) # save the f2 value + + addps p_q(%rsp),%xmm1 #z2 +=q + + movaps p_z1(%rsp),%xmm0 # z1 values + + mulps %xmm6,%xmm2 + addps %xmm2,%xmm0 #r1 + mulps .L__real_log2_tail(%rip),%xmm6 + addps %xmm6,%xmm1 #r2 + addps %xmm1,%xmm0 + + + +# check for e + test $0x0f,%r9d + jnz .L__vlogf_e +.L__f1: + +# check for negative numbers or zero + xorps %xmm1,%xmm1 + cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also. + movmskps %xmm1,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg + +.L__f2: +## if +inf + movaps p_x(%rsp),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__log_inf +.L__f3: + + movaps p_x(%rsp),%xmm3 + subps .L__real_one(%rip),%xmm3 + andps .L__real_notsign(%rip),%xmm3 + cmpps $2,.L__real_threshold(%rip),%xmm3 + movmskps %xmm3,%r9d + test $0x0f,%r9d + jnz .L__near_one +.L__f4: + +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + movups %xmm0,(%rdi) + +# finish the second set of calculations + + # solve for ln(1+u) + movaps %xmm8,%xmm7 # u + mulps %xmm8,%xmm8 # u^2 + movaps %xmm8,%xmm11 + + movaps .L__real_cb3(%rip),%xmm9 + mulps %xmm8,%xmm9 #Cu2 + mulps %xmm7,%xmm11 # u^3 + addps .L__real_cb2(%rip),%xmm9 #B+Cu2 + movaps %xmm8,%xmm10 + mulps %xmm11,%xmm10 # u^5 + movaps .L__real_log2_lead(%rip),%xmm8 + + mulps .L__real_cb1(%rip),%xmm11 #Au3 + addps %xmm11,%xmm7 # u+Au3 + mulps %xmm9,%xmm10 # u5(B+Cu2) + addps %xmm10,%xmm7 # poly + + + # recombine + lea .L__np_ln_tail_table(%rip),%rdx + mov p_idx2(%rsp),%rcx # get the indexes + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + shr $16,%rcx + or -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q2(%rsp) # save the f2 value + + + mov %cx,%r8w + shr $16,%rcx + mov -256(%rdx,%r8,4),%eax # get the f2 value + + mov %cx,%r8w + mov -256(%rdx,%r8,4),%ebx # get the f2 value + shl $32,%rbx + or %rbx,%rax + mov %rax,p_q2+8(%rsp) # save the f2 value + + addps p_q2(%rsp),%xmm7 #z2 +=q + movaps p_z12(%rsp),%xmm1 # z1 values + + mulps %xmm13,%xmm8 + addps %xmm8,%xmm1 #r1 + mulps .L__real_log2_tail(%rip),%xmm13 + addps %xmm13,%xmm7 #r2 + addps %xmm7,%xmm1 + + # check e as a special case + movaps p_x2(%rsp),%xmm10 + cmpps $0,.L__real_ef(%rip),%xmm10 + movmskps %xmm10,%r9d + # check for e + test $0x0f,%r9d + jnz .L__vlogf_e2 +.L__f12: + + # check for negative numbers or zero + xorps %xmm7,%xmm7 + cmpps $1,p_x2(%rsp),%xmm7 # 0 greater than =?. catches NaNs also. + movmskps %xmm7,%r9d + cmp $0x0f,%r9d + jnz .L__z_or_neg2 + +.L__f22: + ## if +inf + movaps p_x2(%rsp),%xmm9 + cmpps $0,.L__real_inf(%rip),%xmm9 + movmskps %xmm9,%r9d + test $0x0f,%r9d + jnz .L__log_inf2 +.L__f32: + + movaps p_x2(%rsp),%xmm9 + subps .L__real_one(%rip),%xmm9 + andps .L__real_notsign(%rip),%xmm9 + cmpps $2,.L__real_threshold(%rip),%xmm9 + movmskps %xmm9,%r9d + test $0x0f,%r9d + jnz .L__near_one2 +.L__f42: + + + prefetch 64(%rsi) + add $32,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + +# store the result _m128d + movups %xmm1,-16(%rdi) + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vsa_top + + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vsa_cleanup + + +# +.L__final_check: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + + + + .align 16 +# we jump here when we have an odd number of log calls to make at the +# end +.L__vsa_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__final_check # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + +# fill in a m128 with zeroes and the extra values and then make a recursive call. + xorps %xmm0,%xmm0 + movaps %xmm0,p2_temp(%rsp) + movaps %xmm0,p2_temp+16(%rsp) + + mov (%rsi),%ecx # we know there's at least one + mov %ecx,p2_temp(%rsp) + cmp $2,%rax + jl .L__vsacg + + mov 4(%rsi),%ecx # do the second value + mov %ecx,p2_temp+4(%rsp) + cmp $3,%rax + jl .L__vsacg + + mov 8(%rsi),%ecx # do the third value + mov %ecx,p2_temp+8(%rsp) + cmp $4,%rax + jl .L__vsacg + + mov 12(%rsi),%ecx # do the fourth value + mov %ecx,p2_temp+12(%rsp) + cmp $5,%rax + jl .L__vsacg + + mov 16(%rsi),%ecx # do the fifth value + mov %ecx,p2_temp+16(%rsp) + cmp $6,%rax + jl .L__vsacg + + mov 20(%rsi),%ecx # do the sixth value + mov %ecx,p2_temp+20(%rsp) + cmp $7,%rax + jl .L__vsacg + + mov 24(%rsi),%ecx # do the last value + mov %ecx,p2_temp+24(%rsp) + +.L__vsacg: + mov $8,%rdi # parameter for N + lea p2_temp(%rsp),%rsi # &x parameter + lea p2_temp1(%rsp),%rdx # &y parameter + call vrsa_logf@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + mov p2_temp1(%rsp),%ecx + mov %ecx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vsacgf + + mov p2_temp1+4(%rsp),%ecx + mov %ecx,4(%rdi) # do the second value + cmp $3,%rax + jl .L__vsacgf + + mov p2_temp1+8(%rsp),%ecx + mov %ecx,8(%rdi) # do the second value + cmp $4,%rax + jl .L__vsacgf + + mov p2_temp1+12(%rsp),%ecx + mov %ecx,12(%rdi) # do the second value + cmp $5,%rax + jl .L__vsacgf + + mov p2_temp1+16(%rsp),%ecx + mov %ecx,16(%rdi) # do the second value + cmp $6,%rax + jl .L__vsacgf + + mov p2_temp1+20(%rsp),%ecx + mov %ecx,20(%rdi) # do the second value + cmp $7,%rax + jl .L__vsacgf + + mov p2_temp1+24(%rsp),%ecx + mov %ecx,24(%rdi) # do the last value + +.L__vsacgf: + jmp .L__final_check + + +.L__vlogf_e: + movdqa p_x(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm0,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + jmp .L__f1 + +.L__vlogf_e2: + movdqa p_x2(%rsp),%xmm2 + cmpps $0,.L__real_ef(%rip),%xmm2 + movdqa %xmm2,%xmm3 + andnps %xmm1,%xmm3 # keep the non-e values + andps .L__real_one(%rip),%xmm2 # setup the 1 values + orps %xmm3,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + jmp .L__f12 + + .align 16 +.L__near_one: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm3,p_omask(%rsp) # save ones mask + movaps p_x(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r +# u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm1 + divps %xmm2,%xmm1 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C +# correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm1,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction +# u = u + u; + addps %xmm1,%xmm1 #u + movaps %xmm1,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 +# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm1,%xmm5 # Cu + movaps %xmm1,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm1 + mulps %xmm1,%xmm1 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm1 #u6(Cu+Du3) + addps %xmm1,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + +# return r + r2; + addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm0,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + + jmp .L__f4 + + + .align 16 +.L__near_one2: +# saves 10 cycles +# r = x - 1.0; + movdqa %xmm9,p_omask(%rsp) # save ones mask + movaps p_x2(%rsp),%xmm3 + movaps .L__real_two(%rip),%xmm2 + subps .L__real_one(%rip),%xmm3 # r + # u = r / (2.0 + r); + addps %xmm3,%xmm2 + movaps %xmm3,%xmm7 + divps %xmm2,%xmm7 # u + movaps .L__real_ca4(%rip),%xmm4 #D + movaps .L__real_ca3(%rip),%xmm5 #C + # correction = r * u; + movaps %xmm3,%xmm6 + mulps %xmm7,%xmm6 # correction + movdqa %xmm6,p_corr(%rsp) # save correction + # u = u + u; + addps %xmm7,%xmm7 #u + movaps %xmm7,%xmm2 + mulps %xmm2,%xmm2 #v =u^2 + # r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + mulps %xmm7,%xmm5 # Cu + movaps %xmm7,%xmm6 + mulps %xmm2,%xmm6 # u^3 + mulps .L__real_ca2(%rip),%xmm2 #Bu^2 + mulps %xmm6,%xmm4 #Du^3 + + addps .L__real_ca1(%rip),%xmm2 # +A + movaps %xmm6,%xmm7 + mulps %xmm7,%xmm7 # u^6 + addps %xmm4,%xmm5 #Cu+Du3 + + mulps %xmm6,%xmm2 #u3(A+Bu2) + mulps %xmm5,%xmm7 #u6(Cu+Du3) + addps %xmm7,%xmm2 + subps p_corr(%rsp),%xmm2 # -correction + + # return r + r2; + addps %xmm2,%xmm3 + + movdqa p_omask(%rsp),%xmm6 + movdqa %xmm6,%xmm2 + andnps %xmm1,%xmm6 # keep the non-nearone values + andps %xmm3,%xmm2 # setup the nearone values + orps %xmm6,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + + jmp .L__f42 + +# we have a zero, a negative number, or both. +# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf. +.L__z_or_neg: +# deal with negatives first + movdqa %xmm1,%xmm3 + andps %xmm0,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm1 # setup the nan values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace +# check for +/- 0 + xorps %xmm1,%xmm1 + cmpps $0,p_x(%rsp),%xmm1 # 0 ?. + movmskps %xmm1,%r9d + test $0x0f,%r9d + jz .L__zn2 + + movdqa %xmm1,%xmm3 + andnps %xmm0,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + +.L__zn2: +# check for NaNs + movaps p_x(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x(%rsp),%xmm1 # isolate the NaNs + pand %xmm4,%xmm1 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm1,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm1 + andnps %xmm0,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm0 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f2 + +# handle only +inf log(+inf) = inf +.L__log_inf: + movdqa %xmm3,%xmm1 + andnps %xmm0,%xmm3 # keep the non-error values + andps p_x(%rsp),%xmm1 # setup the +inf values + orps %xmm3,%xmm1 # merge + movdqa %xmm1,%xmm0 # and replace + jmp .L__f3 + + +.L__z_or_neg2: + # deal with negatives first + movdqa %xmm7,%xmm3 + andps %xmm1,%xmm3 # keep the non-error values + andnps .L__real_nan(%rip),%xmm7 # setup the nan values + orps %xmm3,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + # check for +/- 0 + xorps %xmm7,%xmm7 + cmpps $0,p_x2(%rsp),%xmm7 # 0 ?. + movmskps %xmm7,%r9d + test $0x0f,%r9d + jz .L__zn22 + + movdqa %xmm7,%xmm3 + andnps %xmm1,%xmm3 # keep the non-error values + andps .L__real_ninf(%rip),%xmm7 # ; C99 specs -inf for +-0 + orps %xmm3,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + +.L__zn22: + # check for NaNs + movaps p_x2(%rsp),%xmm3 + andps .L__real_inf(%rip),%xmm3 + cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent + + movdqa p_x2(%rsp),%xmm4 + pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa + pcmpeqd .L__real_zero(%rip),%xmm4 + pandn %xmm3,%xmm4 # mask for NaNs + movdqa %xmm4,%xmm2 + movdqa p_x2(%rsp),%xmm7 # isolate the NaNs + pand %xmm4,%xmm7 + + pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit + por %xmm7,%xmm4 # turn SNaNs to QNaNs + + movdqa %xmm2,%xmm7 + andnps %xmm1,%xmm2 # keep the non-error values + orps %xmm4,%xmm2 # merge + movdqa %xmm2,%xmm1 # and replace + xorps %xmm4,%xmm4 + + jmp .L__f22 + + # handle only +inf log(+inf) = inf +.L__log_inf2: + movdqa %xmm9,%xmm7 + andnps %xmm1,%xmm9 # keep the non-error values + andps p_x2(%rsp),%xmm7 # setup the +inf values + orps %xmm9,%xmm7 # merge + movdqa %xmm7,%xmm1 # and replace + jmp .L__f32 + + + .data + .align 64 + + +.L__real_zero: .quad 0x00000000000000000 # 1.0 + .quad 0x00000000000000000 +.L__real_one: .quad 0x03f8000003f800000 # 1.0 + .quad 0x03f8000003f800000 +.L__real_two: .quad 0x04000000040000000 # 1.0 + .quad 0x04000000040000000 +.L__real_ninf: .quad 0x0ff800000ff800000 # -inf + .quad 0x0ff800000ff800000 +.L__real_inf: .quad 0x07f8000007f800000 # +inf + .quad 0x07f8000007f800000 +.L__real_nan: .quad 0x07fc000007fc00000 # NaN + .quad 0x07fc000007fc00000 +.L__real_ef: .quad 0x0402DF854402DF854 # float e + .quad 0x0402DF854402DF854 + +.L__real_sign: .quad 0x08000000080000000 # sign bit + .quad 0x08000000080000000 +.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit + .quad 0x07ffFFFFF7ffFFFFF +.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit + .quad 0x00040000000400000 +.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits + .quad 0x0007FFFFF007FFFFF +.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */ + .quad 0x03c0000003c000000 +.L__mask_127: .quad 0x00000007f0000007f # + .quad 0x00000007f0000007f +.L__mask_040: .quad 0x00000004000000040 # + .quad 0x00000004000000040 +.L__mask_001: .quad 0x00000000100000001 # + .quad 0x00000000100000001 + + +.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03 + .quad 0x03CF5C28F3CF5C28F + +.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03 + .quad 0x03B1249183B124918 +.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04 + .quad 0x039E401A639E401A6 +.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02 + .quad 0x03DAAAAAB3DAAAAAB +.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02 + .quad 0x03C4CCCCD3C4CCCCD +.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03 + .quad 0x03B124A123B124A12 +.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375 + .quad 0x03F3170003F317000 +.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183 + .quad 0x03805FDF43805FDF4 +.L__real_half: .quad 0x03f0000003f000000 # 1/2 + .quad 0x03f0000003f000000 + + +.L__np_ln__table: + .quad 0x0000000000000000 # 0.00000000000000000000e+00 + .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02 + .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02 + .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02 + .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02 + .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02 + .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02 + .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01 + .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01 + .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01 + .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01 + .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01 + .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01 + .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01 + .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01 + .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01 + .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01 + .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01 + .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01 + .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01 + .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01 + .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01 + .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01 + .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01 + .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01 + .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01 + .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01 + .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01 + .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01 + .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01 + .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01 + .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01 + .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01 + .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01 + .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01 + .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01 + .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01 + .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01 + .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01 + .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01 + .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01 + .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01 + .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01 + .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01 + .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01 + .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01 + .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01 + .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01 + .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01 + .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01 + .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01 + .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01 + .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01 + .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01 + .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01 + .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01 + .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01 + .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01 + .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01 + .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01 + .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01 + .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01 + .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01 + .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01 + .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01 + .quad 0 # for alignment + +.L__np_ln_lead_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x3C7E0000 # 0.015502929688 1 + .long 0x3CFC1000 # 0.030769348145 2 + .long 0x3D3BA000 # 0.045806884766 3 + .long 0x3D785000 # 0.060623168945 4 + .long 0x3D9A0000 # 0.075195312500 5 + .long 0x3DB78000 # 0.089599609375 6 + .long 0x3DD49000 # 0.103790283203 7 + .long 0x3DF13000 # 0.117767333984 8 + .long 0x3E06B000 # 0.131530761719 9 + .long 0x3E14A000 # 0.145141601563 10 + .long 0x3E226000 # 0.158569335938 11 + .long 0x3E2FF000 # 0.171813964844 12 + .long 0x3E3D5000 # 0.184875488281 13 + .long 0x3E4A9000 # 0.197814941406 14 + .long 0x3E579000 # 0.210510253906 15 + .long 0x3E647000 # 0.223083496094 16 + .long 0x3E713000 # 0.235534667969 17 + .long 0x3E7DC000 # 0.247802734375 18 + .long 0x3E851000 # 0.259887695313 19 + .long 0x3E8B3000 # 0.271850585938 20 + .long 0x3E914000 # 0.283691406250 21 + .long 0x3E974000 # 0.295410156250 22 + .long 0x3E9D3000 # 0.307006835938 23 + .long 0x3EA30000 # 0.318359375000 24 + .long 0x3EA8D000 # 0.329711914063 25 + .long 0x3EAE8000 # 0.340820312500 26 + .long 0x3EB43000 # 0.351928710938 27 + .long 0x3EB9C000 # 0.362792968750 28 + .long 0x3EBF5000 # 0.373657226563 29 + .long 0x3EC4D000 # 0.384399414063 30 + .long 0x3ECA3000 # 0.394897460938 31 + .long 0x3ECF9000 # 0.405395507813 32 + .long 0x3ED4E000 # 0.415771484375 33 + .long 0x3EDA2000 # 0.426025390625 34 + .long 0x3EDF5000 # 0.436157226563 35 + .long 0x3EE47000 # 0.446166992188 36 + .long 0x3EE99000 # 0.456176757813 37 + .long 0x3EEEA000 # 0.466064453125 38 + .long 0x3EF3A000 # 0.475830078125 39 + .long 0x3EF89000 # 0.485473632813 40 + .long 0x3EFD7000 # 0.494995117188 41 + .long 0x3F012000 # 0.504394531250 42 + .long 0x3F039000 # 0.513916015625 43 + .long 0x3F05F000 # 0.523193359375 44 + .long 0x3F084000 # 0.532226562500 45 + .long 0x3F0AA000 # 0.541503906250 46 + .long 0x3F0CF000 # 0.550537109375 47 + .long 0x3F0F4000 # 0.559570312500 48 + .long 0x3F118000 # 0.568359375000 49 + .long 0x3F13C000 # 0.577148437500 50 + .long 0x3F160000 # 0.585937500000 51 + .long 0x3F183000 # 0.594482421875 52 + .long 0x3F1A7000 # 0.603271484375 53 + .long 0x3F1C9000 # 0.611572265625 54 + .long 0x3F1EC000 # 0.620117187500 55 + .long 0x3F20E000 # 0.628417968750 56 + .long 0x3F230000 # 0.636718750000 57 + .long 0x3F252000 # 0.645019531250 58 + .long 0x3F273000 # 0.653076171875 59 + .long 0x3F295000 # 0.661376953125 60 + .long 0x3F2B5000 # 0.669189453125 61 + .long 0x3F2D6000 # 0.677246093750 62 + .long 0x3F2F7000 # 0.685302734375 63 + .long 0x3F317000 # 0.693115234375 64 + .long 0 # for alignment + +.L__np_ln_tail_table: + .long 0x00000000 # 0.000000000000 0 + .long 0x35A8B0FC # 0.000001256848 1 + .long 0x361B0E78 # 0.000002310522 2 + .long 0x3631EC66 # 0.000002651266 3 + .long 0x35C30046 # 0.000001452871 4 + .long 0x37EBCB0E # 0.000028108738 5 + .long 0x37528AE5 # 0.000012549314 6 + .long 0x36DA7496 # 0.000006510479 7 + .long 0x3783B715 # 0.000015701671 8 + .long 0x383F3E68 # 0.000045596069 9 + .long 0x38297C10 # 0.000040408282 10 + .long 0x3815B666 # 0.000035694240 11 + .long 0x38183854 # 0.000036292084 12 + .long 0x38448108 # 0.000046850211 13 + .long 0x373539E9 # 0.000010801924 14 + .long 0x3864A740 # 0.000054515200 15 + .long 0x387BE3CD # 0.000060055219 16 + .long 0x3803B715 # 0.000031403342 17 + .long 0x380C36AF # 0.000033429529 18 + .long 0x3892713A # 0.000069829126 19 + .long 0x38AE55D6 # 0.000083129547 20 + .long 0x38A0FDE8 # 0.000076766883 21 + .long 0x3862BAE1 # 0.000054056643 22 + .long 0x3798AAD3 # 0.000018199358 23 + .long 0x38C5E10E # 0.000094356117 24 + .long 0x382D872E # 0.000041372310 25 + .long 0x38DEDFAC # 0.000106274470 26 + .long 0x38481E9B # 0.000047712219 27 + .long 0x38EBFB5E # 0.000112524940 28 + .long 0x38783B83 # 0.000059183232 29 + .long 0x374E1B05 # 0.000012284848 30 + .long 0x38CA0E11 # 0.000096347307 31 + .long 0x3891F660 # 0.000069600297 32 + .long 0x386C9A9A # 0.000056410769 33 + .long 0x38777BCD # 0.000059004688 34 + .long 0x38A6CED4 # 0.000079540216 35 + .long 0x38FBE3CD # 0.000120110439 36 + .long 0x387E7E01 # 0.000060675669 37 + .long 0x37D40984 # 0.000025276800 38 + .long 0x3784C3AD # 0.000015826745 39 + .long 0x380F5FAF # 0.000034182969 40 + .long 0x38AC47BC # 0.000082149607 41 + .long 0x392952D3 # 0.000161479504 42 + .long 0x37F97073 # 0.000029735476 43 + .long 0x3865C84A # 0.000054784388 44 + .long 0x3979CF17 # 0.000238236375 45 + .long 0x38C3D2F5 # 0.000093376184 46 + .long 0x38E6B468 # 0.000110008579 47 + .long 0x383EBCE1 # 0.000045475437 48 + .long 0x39186BDF # 0.000145360347 49 + .long 0x392F0945 # 0.000166927537 50 + .long 0x38E9ED45 # 0.000111545007 51 + .long 0x396B99A8 # 0.000224685878 52 + .long 0x37A27674 # 0.000019367064 53 + .long 0x397069AB # 0.000229275480 54 + .long 0x39013539 # 0.000123222257 55 + .long 0x3947F423 # 0.000190690669 56 + .long 0x3945E10E # 0.000188712234 57 + .long 0x38F85DB0 # 0.000118430122 58 + .long 0x396C08DC # 0.000225100142 59 + .long 0x37B4996F # 0.000021529120 60 + .long 0x397CEADA # 0.000241200818 61 + .long 0x3920261B # 0.000152729845 62 + .long 0x35AA4906 # 0.000001268724 63 + .long 0x3805FDF4 # 0.000031946183 64 + .long 0 # for alignment +
diff --git a/src/gas/vrsapowf.S b/src/gas/vrsapowf.S new file mode 100644 index 0000000..3521a6b --- /dev/null +++ b/src/gas/vrsapowf.S
@@ -0,0 +1,782 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrsapowf.asm +# +# An array implementation of the powf libm function. +# +# Prototype: +# +# void vrsa_powf(int n, float *x, float *y, float *z); +# +# Computes x raised to the y power. +# +# Places the results into the supplied z array. +# Does not perform error handling, but does return C99 values for error +# inputs. Denormal results are truncated to 0. + +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +# define local variable storage offsets +.equ p_temp,0x00 # xmmword +.equ p_negateres,0x10 # qword + + +.equ save_rbx,0x030 #qword + + +.equ p_ax,0x050 # absolute x +.equ p_sx,0x060 # sign of x's + +.equ p_ay,0x070 # absolute y +.equ p_yexp,0x080 # unbiased exponent of y + +.equ p_inty,0x090 # integer y indicators + +.equ p_xptr,0x0a0 # ptr to x values +.equ p_yptr,0x0a8 # ptr to y values +.equ p_zptr,0x0b0 # ptr to z values + +.equ p_nv,0x0b8 #qword +.equ p_iter,0x0c0 # qword storage for number of loop iterations + +.equ p2_temp,0x0d0 #qword +.equ p2_temp1,0x0f0 #qword + +.equ stack_size,0x0118 # allocate 40h more than + # we need to avoid bank conflicts + + + + + .weak vrsa_powf_ + .set vrsa_powf_,__vrsa_powf__ + .weak vrsa_powf__ + .set vrsa_powf__,__vrsa_powf__ + + .text + .align 16 + .p2align 4,,15 + +#/* a FORTRAN subroutine implementation of array powf +#** VRSA_POWF(N,X,Y,Z) +#** C equivalent +#*/ +#void vrsa_powf_(int * n, float *x, float *y, float *z) +#{ +# vrsa_powf(*n,x,y,z); +#} + +.globl __vrsa_powf__ + .type __vrsa_powf__,@function +__vrsa_powf__: + mov (%rdi),%edi + + +# parameters are passed in by Linux as: +# edi - int n +# rsi - float *x +# rdx - float *y +# rcx - float *z + +.globl vrsa_powf + .type vrsa_powf,@function +vrsa_powf: + + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx +# save the arguments + mov %rsi,p_xptr(%rsp) # save pointer to x + mov %rdx,p_yptr(%rsp) # save pointer to y + mov %rcx,p_zptr(%rsp) # save pointer to z +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax +#endif + + mov %rax,%rcx + mov %rcx,p_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vsa_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rcx # compute number of extra single calls + mov %rcx,p_nv(%rsp) # save number of left over values + +# process the array 4 values at a time. + +.L__vsa_top: +# build the input _m128 +# first get x + mov p_xptr(%rsp),%rsi # get x_array pointer + movups (%rsi),%xmm0 + prefetch 64(%rsi) + + movaps %xmm0,%xmm2 + andps .L__mask_nsign(%rip),%xmm0 # get abs x + andps .L__mask_sign(%rip),%xmm2 # mask for the sign bits + movaps %xmm0,p_ax(%rsp) # save them + movaps %xmm2,p_sx(%rsp) # save them +# convert all four x's to double + cvtps2pd p_ax(%rsp),%xmm0 + cvtps2pd p_ax+8(%rsp),%xmm1 +# +# classify y +# vector 32 bit integer method 25 cycles to here +# /* See whether y is an integer. +# inty = 0 means not an integer. +# inty = 1 means odd integer. +# inty = 2 means even integer. +# */ + mov p_yptr(%rsp),%rdi # get y_array pointer + movups (%rdi),%xmm4 + prefetch 64(%rdi) + pxor %xmm3,%xmm3 + pand .L__mask_nsign(%rip),%xmm4 # get abs y in integer format + movdqa %xmm4,p_ay(%rsp) # save it + +# see if the number is less than 1.0 + psrld $23,%xmm4 #>> EXPSHIFTBITS_SP32 + + psubd .L__mask_127(%rip),%xmm4 # yexp, unbiased exponent + movdqa %xmm4,p_yexp(%rsp) # save it + paddd .L__mask_1(%rip),%xmm4 # yexp+1 + pcmpgtd %xmm3,%xmm4 # 0 if exp less than 126 (2^0) (y < 1.0), else FFs +# xmm4 is ffs if abs(y) >=1.0, else 0 + +# see if the mantissa has fractional bits +#build mask for mantissa + movdqa .L__mask_23(%rip),%xmm2 + psubd p_yexp(%rsp),%xmm2 # 24-yexp + pmaxsw %xmm3,%xmm2 # no shift counts less than 0 + movdqa %xmm2,p_temp(%rsp) # save the shift counts +# create mask for all four values +# SSE can't individual shifts so have to do 0xeac one seperately + mov p_temp(%rsp),%rcx + mov $1,%rbx + shl %cl,%ebx #1 << (24 - yexp) + shr $32,%rcx + mov $1,%eax + shl %cl,%eax #1 << (24 - yexp) + shl $32,%rax + add %rax,%rbx + mov %rbx,p_temp(%rsp) + mov p_temp+8(%rsp),%rcx + mov $1,%rbx + shl %cl,%ebx #1 << (24 - yexp) + shr $32,%rcx + mov $1,%eax + shl %cl,%eax #1 << (24 - yexp) + shl $32,%rax + add %rbx,%rax + mov %rax,p_temp+8(%rsp) + movdqa p_temp(%rsp),%xmm5 + psubd .L__mask_1(%rip),%xmm5 #= mask = (1 << (24 - yexp)) - 1 + +# now use the mask to see if there are any fractional bits + movdqu (%rdi),%xmm2 # get uy + pand %xmm5,%xmm2 # uy & mask + pcmpeqd %xmm3,%xmm2 # 0 if not zero (y has fractional mantissa bits), else FFs + pand %xmm4,%xmm2 # either 0s or ff +# xmm2 now accounts for y< 1.0 or y>=1.0 and y has fractional mantissa bits, +# it has the value 0 if we know it's non-integer or ff if integer. + +# now see if it's even or odd. + +## if yexp > 24, then it has to be even + movdqa .L__mask_24(%rip),%xmm4 + psubd p_yexp(%rsp),%xmm4 # 24-yexp + paddd .L__mask_1(%rip),%xmm5 # mask+1 = least significant integer bit + pcmpgtd %xmm3,%xmm4 # if 0, then must be even, else ff's + + pand %xmm4,%xmm5 # set the integer bit mask to zero if yexp>24 + paddd .L__mask_2(%rip),%xmm4 + por .L__mask_2(%rip),%xmm4 + pand %xmm2,%xmm4 # result can be 0, 2, or 3 + +# now for integer numbers, see if odd or even + pand .L__mask_mant(%rip),%xmm5 # mask out exponent bits + movdqu (%rdi),%xmm2 + pand %xmm2,%xmm5 # & uy -> even or odd + movdqa .L__float_one(%rip),%xmm2 + pcmpeqd p_ay(%rsp),%xmm2 # is ay equal to 1, ff's if so, then it's odd + pand .L__mask_nsign(%rip),%xmm2 # strip the sign bit so the gt comparison works. + por %xmm2,%xmm5 + pcmpgtd %xmm3,%xmm5 # if odd then ff's, else 0's for even + paddd .L__mask_2(%rip),%xmm5 # gives us 2 for even, 1 for odd + pand %xmm5,%xmm4 + + movdqa %xmm4,p_inty(%rsp) # save inty +# +# do more x special case checking +# + movdqa %xmm4,%xmm5 + pcmpeqd %xmm3,%xmm5 # is not an integer? ff's if so + pand .L__mask_NaN(%rip),%xmm5 # these values will be NaNs, if x<0 + movdqa %xmm4,%xmm2 + pcmpeqd .L__mask_1(%rip),%xmm2 # is it odd? ff's if so + pand .L__mask_sign(%rip),%xmm2 # these values will get their sign bit set + por %xmm2,%xmm5 + + pcmpeqd p_sx(%rsp),%xmm3 # if the signs are set + pandn %xmm5,%xmm3 # then negateres gets the values as shown below + movdqa %xmm3,p_negateres(%rsp) # save negateres + +# /* p_negateres now means the following. +# 7FC00000 means x<0, y not an integer, return NaN. +# 80000000 means x<0, y is odd integer, so set the sign bit. +## 0 means even integer, and/or x>=0. +# */ + + +# **** Here starts the main calculations **** +# The algorithm used is x**y = exp(y*log(x)) +# Extra precision is required in intermediate steps to meet the 1ulp requirement +# +# log(x) calculation + call __vrd4_log@PLT # get the double precision log value + # for all four x's +# y* logx +# convert all four y's to double +# mov p_yptr(%rsp),%rdi ; get y_array pointer + cvtps2pd (%rdi),%xmm2 + cvtps2pd 8(%rdi),%xmm3 + +# /* just multiply by y */ + mulpd %xmm2,%xmm0 + mulpd %xmm3,%xmm1 + +# /* The following code computes r = exp(w) */ + call __vrd4_exp@PLT # get the double exp value + # for all four y*log(x)'s + mov p_xptr(%rsp),%rsi # get x_array pointer + mov p_yptr(%rsp),%rdi # get y_array pointer +# +# convert all four results to double + cvtpd2ps %xmm0,%xmm0 + cvtpd2ps %xmm1,%xmm1 + movlhps %xmm1,%xmm0 + +# perform special case and error checking on input values + +# special case checking is done first in the scalar version since +# it allows for early fast returns. But for vectors, we consider them +# to be rare, so early returns are not necessary. So we first compute +# the x**y values, and then check for special cases. + +# we do some of the checking in reverse order of the scalar version. +# apply the negate result flags + orps p_negateres(%rsp),%xmm0 # get negateres + +## if y is infinite or so large that the result would overflow or underflow + movdqa p_ay(%rsp),%xmm4 + cmpps $5,.L__mask_ly(%rip),%xmm4 # y not less than large value, ffs if so. + movmskps %xmm4,%edx + test $0x0f,%edx + jnz .Ly_large +.Lrnsx3: + +## if x is infinite + movdqa p_ax(%rsp),%xmm4 + cmpps $0,.L__mask_inf(%rip),%xmm4 # equal to infinity, ffs if so. + movmskps %xmm4,%edx + test $0x0f,%edx + jnz .Lx_infinite +.Lrnsx1: +## if x is zero + xorps %xmm4,%xmm4 + cmpps $0,p_ax(%rsp),%xmm4 # equal to zero, ffs if so. + movmskps %xmm4,%edx + test $0x0f,%edx + jnz .Lx_zero +.Lrnsx2: +## if y is NAN + movdqu (%rdi),%xmm4 # get y + cmpps $4,%xmm4,%xmm4 # a compare not equal of y to itself should + # be false, unless y is a NaN. ff's if NaN. + movmskps %xmm4,%ecx + test $0x0f,%ecx + jnz .Ly_NaN +.Lrnsx4: +## if x is NAN + movdqu (%rsi),%xmm4 # get x + cmpps $4,%xmm4,%xmm4 # a compare not equal of x to itself should + # be false, unless x is a NaN. ff's if NaN. + movmskps %xmm4,%ecx + test $0x0f,%ecx + jnz .Lx_NaN +.Lrnsx5: + +## if |y| == 0 then return 1 + movdqa .L__float_one(%rip),%xmm3 # one + xorps %xmm2,%xmm2 + cmpps $4,p_ay(%rsp),%xmm2 # not equal to 0.0?, ffs if not equal. + andps %xmm2,%xmm0 # keep the others + andnps %xmm3,%xmm2 # mask for ones + orps %xmm2,%xmm0 +## if x == +1, return +1 for all x + movdqa %xmm3,%xmm2 + movdqu (%rsi),%xmm5 + cmpps $4,%xmm5,%xmm2 # not equal to +1.0?, ffs if not equal. + andps %xmm2,%xmm0 # keep the others + andnps %xmm3,%xmm2 # mask for ones + orps %xmm2,%xmm0 + +.L__powf_cleanup2: + +# update the x and y pointers + add $16,%rdi + add $16,%rsi + mov %rsi,p_xptr(%rsp) # save x_array pointer + mov %rdi,p_yptr(%rsp) # save y_array pointer +# store the result _m128d + mov p_zptr(%rsp),%rdi # get z_array pointer + movups %xmm0,(%rdi) +# prefetchw QWORD PTR [rdi+64] + prefetch 64(%rdi) + add $16,%rdi + mov %rdi,p_zptr(%rsp) # save z_array pointer + + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vsa_top + + +# see if we need to do any extras + mov p_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vsa_cleanup + +.L__final_check: + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + + .align 16 +# we jump here when we have an odd number of log calls to make at the +# end +.L__vsa_cleanup: + mov p_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__final_check # exit if not + + mov p_xptr(%rsp),%rsi + mov p_yptr(%rsp),%rdi + +# fill in a m128 with zeroes and the extra values and then make a recursive call. + xorps %xmm0,%xmm0 + movaps %xmm0,p2_temp(%rsp) + movaps %xmm0,p2_temp+16(%rsp) + + mov (%rsi),%ecx # we know there's at least one + mov %ecx,p2_temp(%rsp) + mov (%rdi),%edx # we know there's at least one + mov %edx,p2_temp+16(%rsp) + cmp $2,%rax + jl .L__vsacg + + mov 4(%rsi),%ecx # do the second value + mov %ecx,p2_temp+4(%rsp) + mov 4(%rdi),%edx # we know there's at least one + mov %edx,p2_temp+20(%rsp) + cmp $3,%rax + jl .L__vsacg + + mov 8(%rsi),%ecx # do the third value + mov %ecx,p2_temp+8(%rsp) + mov 8(%rdi),%edx # we know there's at least one + mov %edx,p2_temp+24(%rsp) + +.L__vsacg: + mov $4,%rdi # parameter for N + lea p2_temp(%rsp),%rsi # &x parameter + lea p2_temp+16(%rsp),%rdx # &y parameter + lea p2_temp1(%rsp),%rcx # &z parameter + call vrsa_powf@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov p_zptr(%rsp),%rdi + mov p_nv(%rsp),%rax # get number of values + mov p2_temp1(%rsp),%ecx + mov %ecx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vsacgf + + mov p2_temp1+4(%rsp),%ecx + mov %ecx,4(%rdi) # do the second value + cmp $3,%rax + jl .L__vsacgf + + mov p2_temp1+8(%rsp),%ecx + mov %ecx,8(%rdi) # do the third value + +.L__vsacgf: + jmp .L__final_check + + .align 16 +# y is a NaN. +.Ly_NaN: + mov p_yptr(%rsp),%rdx # get pointer to y + movdqu (%rdx),%xmm4 # get y + movdqa %xmm4,%xmm3 + movdqa %xmm4,%xmm5 + movdqa .L__mask_sigbit(%rip),%xmm2 # get the signalling bits + cmpps $0,%xmm4,%xmm4 # a compare equal of y to itself should + # be true, unless y is a NaN. 0's if NaN. + cmpps $4,%xmm3,%xmm3 # compare not equal, ff's if NaN. + andps %xmm4,%xmm0 # keep the other results + andps %xmm3,%xmm2 # get just the right signalling bits + andps %xmm5,%xmm3 # mask for the NaNs + orps %xmm2,%xmm3 # convert to QNaNs + orps %xmm3,%xmm0 # combine + jmp .Lrnsx4 + +# y is a NaN. +.Lx_NaN: + mov p_xptr(%rsp),%rcx # get pointer to x + movdqu (%rcx),%xmm4 # get x + movdqa %xmm4,%xmm3 + movdqa %xmm4,%xmm5 + movdqa .L__mask_sigbit(%rip),%xmm2 # get the signalling bits + cmpps $0,%xmm4,%xmm4 # a compare equal of x to itself should + # be true, unless x is a NaN. 0's if NaN. + cmpps $4,%xmm3,%xmm3 # compare not equal, ff's if NaN. + andps %xmm4,%xmm0 # keep the other results + andps %xmm3,%xmm2 # get just the right signalling bits + andps %xmm5,%xmm3 # mask for the NaNs + orps %xmm2,%xmm3 # convert to QNaNs + orps %xmm3,%xmm0 # combine + jmp .Lrnsx5 + +# y is infinite or so large that the result would +# overflow or underflow. +.Ly_large: + movdqa %xmm0,p_temp(%rsp) + + test $1,%edx + jz .Lylrga + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_yptr(%rsp),%rbx # get pointer to y + mov (%rcx),%eax + mov (%rbx),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp(%rsp) +.Lylrga: + test $2,%edx + jz .Lylrgb + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_yptr(%rsp),%rbx # get pointer to y + mov 4(%rcx),%eax + mov 4(%rbx),%ebx + mov p_inty+4(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+4(%rsp) +.Lylrgb: + test $4,%edx + jz .Lylrgc + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_yptr(%rsp),%rbx # get pointer to y + mov 8(%rcx),%eax + mov 8(%rbx),%ebx + mov p_inty+8(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+8(%rsp) +.Lylrgc: + test $8,%edx + jz .Lylrgd + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_yptr(%rsp),%rbx # get pointer to y + mov 12(%rcx),%eax + mov 12(%rbx),%ebx + mov p_inty+12(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+12(%rsp) +.Lylrgd: + movdqa p_temp(%rsp),%xmm0 + jmp .Lrnsx3 + +# a subroutine to treat an individual x,y pair when y is large or infinity +# assumes x in .Ly(%rip),%eax in ebx. +# returns result in eax +.Lnp_special6: +# handle |x|==1 cases first + mov $0x07FFFFFFF,%r8d + and %eax,%r8d + cmp $0x03f800000,%r8d # jump if |x| !=1 + jnz .Lnps6 + mov $0x03f800000,%eax # return 1 for all |x|==1 + jmp .Lnpx64 + +# cases where |x| !=1 +.Lnps6: + mov $0x07f800000,%ecx + xor %eax,%eax # assume 0 return + test $0x080000000,%ebx + jnz .Lnps62 # jump if y negative +# y = +inf + cmp $0x03f800000,%r8d + cmovg %ecx,%eax # return inf if |x| < 1 + jmp .Lnpx64 +.Lnps62: +# y = -inf + cmp $0x03f800000,%r8d + cmovl %ecx,%eax # return inf if |x| < 1 + jmp .Lnpx64 + +.Lnpx64: + ret + +# handle cases where x is +/- infinity. edx is the mask + .align 16 +.Lx_infinite: + movdqa %xmm0,p_temp(%rsp) + + test $1,%edx + jz .Lxinfa + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_yptr(%rsp),%rbx # get pointer to y + mov (%rcx),%eax + mov (%rbx),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp(%rsp) +.Lxinfa: + test $2,%edx + jz .Lxinfb + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_yptr(%rsp),%rbx # get pointer to y + mov 4(%rcx),%eax + mov 4(%rbx),%ebx + mov p_inty+4(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+4(%rsp) +.Lxinfb: + test $4,%edx + jz .Lxinfc + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_yptr(%rsp),%rbx # get pointer to y + mov 8(%rcx),%eax + mov 8(%rbx),%ebx + mov p_inty+8(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+8(%rsp) +.Lxinfc: + test $8,%edx + jz .Lxinfd + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_yptr(%rsp),%rbx # get pointer to y + mov 12(%rcx),%eax + mov 12(%rbx),%ebx + mov p_inty+12(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+12(%rsp) +.Lxinfd: + movdqa p_temp(%rsp),%xmm0 + jmp .Lrnsx1 + +# a subroutine to treat an individual x,y pair when x is +/-infinity +# assumes x in .Ly(%rip),%eax in ebx, inty in ecx. +# returns result in eax +.Lnp_special_x1: # x is infinite + test $0x080000000,%eax # is x positive + jnz .Lnsx11 # jump if not + test $0x080000000,%ebx # is y positive + jz .Lnsx13 # just return if so + xor %eax,%eax # else return 0 + jmp .Lnsx13 + +.Lnsx11: + cmp $1,%ecx # if inty ==1 + jnz .Lnsx12 # jump if not + test $0x080000000,%ebx # is y positive + jz .Lnsx13 # just return if so + mov $0x080000000,%eax # else return -0 + jmp .Lnsx13 +.Lnsx12: # inty <>1 + and $0x07FFFFFFF,%eax # return -x (|x|) if y<0 + test $0x080000000,%ebx # is y positive + jz .Lnsx13 # + xor %eax,%eax # return 0 if y >=0 +.Lnsx13: + ret + + +# handle cases where x is +/- zero. edx is the mask of x,y pairs with |x|=0 + .align 16 +.Lx_zero: + movdqa %xmm0,p_temp(%rsp) + + test $1,%edx + jz .Lxzera + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_yptr(%rsp),%rbx # get pointer to y + mov (%rcx),%eax + mov (%rbx),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp(%rsp) +.Lxzera: + test $2,%edx + jz .Lxzerb + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_yptr(%rsp),%rbx # get pointer to y + mov 4(%rcx),%eax + mov 4(%rbx),%ebx + mov p_inty+4(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+4(%rsp) +.Lxzerb: + test $4,%edx + jz .Lxzerc + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_yptr(%rsp),%rbx # get pointer to y + mov 8(%rcx),%eax + mov 8(%rbx),%ebx + mov p_inty+8(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+8(%rsp) +.Lxzerc: + test $8,%edx + jz .Lxzerd + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_yptr(%rsp),%rbx # get pointer to y + mov 12(%rcx),%eax + mov 12(%rbx),%ebx + mov p_inty+12(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+12(%rsp) +.Lxzerd: + movdqa p_temp(%rsp),%xmm0 + jmp .Lrnsx2 + +# a subroutine to treat an individual x,y pair when x is +/-0 +# assumes x in .Ly(%rip),%eax in ebx, inty in ecx. +# returns result in eax + .align 16 +.Lnp_special_x2: + cmp $1,%ecx # if inty ==1 + jz .Lnsx21 # jump if so +# handle cases of x=+/-0, y not integer + xor %eax,%eax + mov $0x07f800000,%ecx + test $0x080000000,%ebx # is ypos + cmovnz %ecx,%eax + jmp .Lnsx23 +# y is an integer +.Lnsx21: + xor %r8d,%r8d + mov $0x07f800000,%ecx + test $0x080000000,%ebx # is ypos + cmovnz %ecx,%r8d # set to infinity if not + and $0x080000000,%eax # pickup the sign of x + or %r8d,%eax # and include it in the result +.Lnsx23: + ret + + + + .data + .align 64 + +.L__mask_sign: .quad 0x08000000080000000 # a sign bit mask + .quad 0x08000000080000000 + +.L__mask_nsign: .quad 0x07FFFFFFF7FFFFFFF # a not sign bit mask + .quad 0x07FFFFFFF7FFFFFFF + +# used by inty +.L__mask_127: .quad 0x00000007F0000007F # EXPBIAS_SP32 + .quad 0x00000007F0000007F + +.L__mask_mant: .quad 0x0007FFFFF007FFFFF # mantissa bit mask + .quad 0x0007FFFFF007FFFFF + +.L__mask_1: .quad 0x00000000100000001 # 1 + .quad 0x00000000100000001 + +.L__mask_2: .quad 0x00000000200000002 # 2 + .quad 0x00000000200000002 + +.L__mask_24: .quad 0x00000001800000018 # 24 + .quad 0x00000001800000018 + +.L__mask_23: .quad 0x00000001700000017 # 23 + .quad 0x00000001700000017 + +# used by special case checking + +.L__float_one: .quad 0x03f8000003f800000 # one + .quad 0x03f8000003f800000 + +.L__mask_inf: .quad 0x07f8000007F800000 # inifinity + .quad 0x07f8000007F800000 + +.L__mask_NaN: .quad 0x07fC000007FC00000 # NaN + .quad 0x07fC000007FC00000 + +.L__mask_sigbit: .quad 0x00040000000400000 # QNaN bit + .quad 0x00040000000400000 + +.L__mask_ly: .quad 0x04f0000004f000000 # large y + .quad 0x04f0000004f000000 + +
diff --git a/src/gas/vrsapowxf.S b/src/gas/vrsapowxf.S new file mode 100644 index 0000000..4f67daf --- /dev/null +++ b/src/gas/vrsapowxf.S
@@ -0,0 +1,753 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrsapowxf.asm +# +# An array implementation of the powf libm function. +# This routine raises the x array to a constant y power. +# +# Prototype: +# +# void vrsa_powxf(int n, float *x, float y, float *z); +# +# Places the results into the supplied z array. +# Does not perform error handling, but does return C99 values for error +# inputs. Denormal results are truncated to 0. +# +# + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + + +# define local variable storage offsets +.equ p_temp,0x00 # xmmword +.equ p_negateres,0x10 # qword + +.equ p_xexp,0x20 # qword + +.equ save_rbx,0x030 #qword + +.equ p_y,0x048 # y value + +.equ p_ax,0x050 # absolute x +.equ p_sx,0x060 # sign of x's + +.equ p_ay,0x070 # absolute y +.equ p_yexp,0x080 # unbiased exponent of y + +.equ p_inty,0x090 # integer y indicator + +.equ p_xptr,0x0a0 # ptr to x values +.equ p_zptr,0x0b0 # ptr to z values + +.equ p_nv,0x0b8 #qword +.equ p_iter,0x0c0 # qword storage for number of loop iterations + +.equ p2_temp,0x0d0 #qword +.equ p2_temp1,0x0f0 #qword + +.equ stack_size,0x0118 # allocate 40h more than + # we need to avoid bank conflicts + + + + + .weak vrsa_powxf_ + .set vrsa_powxf_,__vrsa_powxf__ + .weak vrsa_powxf__ + .set vrsa_powxf__,__vrsa_powxf__ + + .text + .align 16 + .p2align 4,,15 +.globl __vrsa_powxf__ + .type __vrsa_powxf__,@function +__vrsa_powxf__: + +#/* a FORTRAN subroutine implementation of array powf +#** VRSA_POWXF(N,X,Y,Z) +#** C equivalent +#*/ +#void vrsa_powxf_(int * n, float *x, float *y, float *z) +#{ +# vrsa_powxf(*n,x,y,z); +#} +# parameters are passed in by Linux FORTRAN as: +# edi - int n +# rsi - float *x +# rdx - float *y +# rcx - float *z + mov (%rdi),%edi + movss (%rdx),%xmm0 + mov %rcx,%rdx + + + + +# parameters are passed in by Linux C as: +# edi - int n +# rsi - float *x +# xmm0 - float y +# rdx - float *z + +.globl vrsa_powxf + .type vrsa_powxf,@function +vrsa_powxf: + + sub $stack_size,%rsp + mov %rbx,save_rbx(%rsp) # save rbx + + movss %xmm0,p_y(%rsp) # save y + mov %rsi,p_xptr(%rsp) # save pointer to x + mov %rdx,p_zptr(%rsp) # save pointer to z +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax +#endif + test %rax,%rax # just return if count is zero + jz .L__final_check # exit if not + + mov %rax,%rcx + mov %rcx,p_nv(%rsp) # save number of values + +# +# classify y +# vector 32 bit integer method +# /* See whether y is an integer. +# inty = 0 means not an integer. +# inty = 1 means odd integer. +# inty = 2 means even integer. +# */ +# movdqa .LXMMWORD(%rip),%xmm4 PTR [rdx] +# get yexp + mov p_y(%rsp),%r8d # r8 is uy + mov $0x07fffffff,%r9d + and %r8d,%r9d # r9 is ay + +## if |y| == 0 then return 1 + cmp $0,%r9d # is y a zero? + jz .Ly_zero + + mov $0x07f800000,%eax # EXPBITS_SP32 + and %r9d,%eax # y exp + + xor %edi,%edi + shr $23,%eax #>> EXPSHIFTBITS_SP32 + sub $126,%eax # - EXPBIAS_SP32 + 1 - eax is now the unbiased exponent + mov $1,%ebx + cmp %ebx,%eax # if (yexp < 1) + cmovl %edi,%ebx + jl .Lsave_inty + + mov $24,%ecx + cmp %ecx,%eax # if (yexp >24) + jle .Lcly1 + mov $2,%ebx + jmp .Lsave_inty +.Lcly1: # else 1<=yexp<=24 + sub %eax,%ecx # build mask for mantissa + shl %cl,%ebx + dec %ebx # rbx = mask = (1 << (24 - yexp)) - 1 + + mov %r8d,%eax + and %ebx,%eax # if ((uy & mask) != 0) + cmovnz %edi,%ebx # inty = 0; + jnz .Lsave_inty + + not %ebx # else if (((uy & ~mask) >> (24 - yexp)) & 0x00000001) + mov %r8d,%eax + and %ebx,%eax + shr %cl,%eax + inc %edi + and %edi,%eax + mov %edi,%ebx # inty = 1 + jnz .Lsave_inty + inc %ebx # else inty = 2 + + +.Lsave_inty: + mov %r8d,p_y+4(%rsp) # save an extra copy of y + mov %ebx,p_inty(%rsp) # save inty + + mov p_nv(%rsp),%rax # get number of values + mov %rax,%rcx +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vsa_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rcx # compute number of extra single calls + mov %rcx,p_nv(%rsp) # save number of left over values + +# process the array 4 values at a time. + +.L__vsa_top: +# build the input _m128 +# first get x + mov p_xptr(%rsp),%rsi # get x_array pointer + movups (%rsi),%xmm0 + prefetch 64(%rsi) + + + movaps %xmm0,%xmm2 + andps .L__mask_nsign(%rip),%xmm0 # get abs x + andps .L__mask_sign(%rip),%xmm2 # mask for the sign bits + movaps %xmm0,p_ax(%rsp) # save them + movaps %xmm2,p_sx(%rsp) # save them +# convert all four x's to double + cvtps2pd p_ax(%rsp),%xmm0 + cvtps2pd p_ax+8(%rsp),%xmm1 +# +# do x special case checking +# +# movdqa %xmm4,%xmm5 +# pcmpeqd %xmm3,%xmm5 ; is y not an integer? ff's if so +# pand .LXMMWORD(%rip),%xmm5 PTR __mask_NaN ; these values will be NaNs, if x<0 + pxor %xmm3,%xmm3 + xor %eax,%eax + mov $0x07FC00000,%ecx + cmp $0,%ebx # is y not an integer? + cmovz %ecx,%eax # then set to return a NaN. else 0. + mov $0x080000000,%ecx + cmp $1,%ebx # is y an odd integer? + cmovz %ecx,%eax # maybe set sign bit if so + movd %eax,%xmm5 + pshufd $0,%xmm5,%xmm5 +# shufps xmm5,%xmm5 +# movdqa %xmm4,%xmm2 +# pcmpeqd .LXMMWORD(%rip),%xmm2 PTR __mask_1 ; is it odd? ff's if so +# pand .LXMMWORD(%rip),%xmm2 PTR __mask_sign ; these values might get their sign bit set +# por %xmm2,%xmm5 + +# cmpps xmm3,XMMWORD PTR p_sx[rsp],0 ; if the signs are set + pcmpeqd p_sx(%rsp),%xmm3 # if the signs are set + pandn %xmm5,%xmm3 # then negateres gets the values as shown below + movdqa %xmm3,p_negateres(%rsp) # save negateres + +# /* p_negateres now means the following. +# 7FC00000 means x<0, y not an integer, return NaN. +# 80000000 means x<0, y is odd integer, so set the sign bit. +## 0 means even integer, and/or x>=0. +# */ + +# **** Here starts the main calculations **** +# The algorithm used is x**y = exp(y*log(x)) +# Extra precision is required in intermediate steps to meet the 1ulp requirement +# +# log(x) calculation + call __vrd4_log@PLT # get the double precision log value + # for all four x's +# y* logx + cvtps2pd p_y(%rsp),%xmm2 #convert the two packed single y's to double + +# /* just multiply by y */ + mulpd %xmm2,%xmm0 + mulpd %xmm2,%xmm1 + +# /* The following code computes r = exp(w) */ + call __vrd4_exp@PLT # get the double exp value + # for all four y*log(x)'s + mov p_xptr(%rsp),%rsi # get x_array pointer + +# +# convert all four results to double + cvtpd2ps %xmm0,%xmm0 + cvtpd2ps %xmm1,%xmm1 + movlhps %xmm1,%xmm0 + +# perform special case and error checking on input values + +# special case checking is done first in the scalar version since +# it allows for early fast returns. But for vectors, we consider them +# to be rare, so early returns are not necessary. So we first compute +# the x**y values, and then check for special cases. + +# we do some of the checking in reverse order of the scalar version. +# apply the negate result flags + orps p_negateres(%rsp),%xmm0 # get negateres + +## if y is infinite or so large that the result would overflow or underflow + mov p_y(%rsp),%edx # get y + and $0x07fffffff,%edx # develop ay +# mov $0x04f000000,%eax + cmp $0x04f000000,%edx + ja .Ly_large +.Lrnsx3: + +## if x is infinite + movdqa p_ax(%rsp),%xmm4 + cmpps $0,.L__mask_inf(%rip),%xmm4 # equal to infinity, ffs if so. + movmskps %xmm4,%edx + test $0x0f,%edx + jnz .Lx_infinite +.Lrnsx1: +## if x is zero + xorps %xmm4,%xmm4 + cmpps $0,p_ax(%rsp),%xmm4 # equal to zero, ffs if so. + movmskps %xmm4,%edx + test $0x0f,%edx + jnz .Lx_zero +.Lrnsx2: +## if y is NAN + movss p_y(%rsp),%xmm4 # get y + ucomiss %xmm4,%xmm4 # comparing y to itself should + # be true, unless y is a NaN. parity flag if NaN. + jp .Ly_NaN +.Lrnsx4: +## if x is NAN + movdqa p_ax(%rsp),%xmm4 # get x + cmpps $4,%xmm4,%xmm4 # a compare not equal of x to itself should + # be false, unless x is a NaN. ff's if NaN. + movmskps %xmm4,%ecx + test $0x0f,%ecx + jnz .Lx_NaN +.Lrnsx5: + +## if x == +1, return +1 for all x + movdqa .L__float_one(%rip),%xmm3 # one + mov p_xptr(%rsp),%rdx # get pointer to x + movdqa %xmm3,%xmm2 + movdqu (%rdx), %xmm5 + cmpps $4,%xmm5,%xmm2 # not equal to +1.0?, ffs if not equal. + andps %xmm2,%xmm0 # keep the others + andnps %xmm3,%xmm2 # mask for ones + orps %xmm2,%xmm0 + +.L__vsa_bottom: + +# update the x and y pointers + add $16,%rsi + mov %rsi,p_xptr(%rsp) # save x_array pointer +# store the result _m128d + mov p_zptr(%rsp),%rdi # get z_array pointer + movups %xmm0,(%rdi) +# prefetchw QWORD PTR [rdi+64] + prefetch 64(%rdi) + add $16,%rdi + mov %rdi,p_zptr(%rsp) # save z_array pointer + + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vsa_top + + +# see if we need to do any extras + mov p_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vsa_cleanup + +.L__final_check: + + mov save_rbx(%rsp),%rbx # restore rbx + add $stack_size,%rsp + ret + + .align 16 +# we jump here when we have an odd number of calls to make at the +# end +.L__vsa_cleanup: + mov p_nv(%rsp),%rax # get number of values + + mov p_xptr(%rsp),%rsi + mov p_y(%rsp),%r8d # r8 is uy + +# fill in a m128 with zeroes and the extra values and then make a recursive call. + xorps %xmm0,%xmm0 + movaps %xmm0,p2_temp(%rsp) + movaps %xmm0,p2_temp+16(%rsp) + + mov (%rsi),%ecx # we know there's at least one + mov %ecx,p2_temp(%rsp) + mov %r8d,p2_temp+16(%rsp) + cmp $2,%rax + jl .L__vsacg + + mov 4(%rsi),%ecx # do the second value + mov %ecx,p2_temp+4(%rsp) + mov %r8d,p2_temp+20(%rsp) + cmp $3,%rax + jl .L__vsacg + + mov 8(%rsi),%ecx # do the third value + mov %ecx,p2_temp+8(%rsp) + mov %r8d,p2_temp+24(%rsp) + +.L__vsacg: + mov $4,%rdi # parameter for N + lea p2_temp(%rsp),%rsi # &x parameter + movaps p2_temp+16(%rsp),%xmm0 # y parameter + lea p2_temp1(%rsp),%rdx # &z parameter + call vrsa_powxf@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov p_zptr(%rsp),%rdi + mov p_nv(%rsp),%rax # get number of values + mov p2_temp1(%rsp),%ecx + mov %ecx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vsacgf + + mov p2_temp1+4(%rsp),%ecx + mov %ecx,4(%rdi) # do the second value + cmp $3,%rax + jl .L__vsacgf + + mov p2_temp1+8(%rsp),%ecx + mov %ecx,8(%rdi) # do the third value + +.L__vsacgf: + jmp .L__final_check + + + .align 16 +.Ly_zero: +## if |y| == 0 then return 1 + mov $0x03f800000,%ecx # one +# fill all results with a one + mov p_zptr(%rsp),%r9 # &z parameter + mov p_nv(%rsp),%rax # get number of values +.L__yzt: + mov %ecx,(%r9) # store a 1 + add $4,%r9 + sub $1,%rax + test %rax,%rax + jnz .L__yzt + jmp .L__final_check +# y is a NaN. +.Ly_NaN: + mov p_y(%rsp),%r8d + or $0x000400000,%r8d # convert to QNaNs + movd %r8d,%xmm0 # propagate to all results + shufps $0,%xmm0,%xmm0 + jmp .Lrnsx4 + +# x is a NaN. +.Lx_NaN: + mov p_xptr(%rsp),%rcx # get pointer to x + movdqu (%rcx),%xmm4 # get x + movdqa %xmm4,%xmm3 + movdqa %xmm4,%xmm5 + movdqa .L__mask_sigbit(%rip),%xmm2 # get the signalling bits + cmpps $0,%xmm4,%xmm4 # a compare equal of x to itself should + # be true, unless x is a NaN. 0's if NaN. + cmpps $4,%xmm3,%xmm3 # compare not equal, ff's if NaN. + andps %xmm4,%xmm0 # keep the other results + andps %xmm3,%xmm2 # get just the right signalling bits + andps %xmm5,%xmm3 # mask for the NaNs + orps %xmm2,%xmm3 # convert to QNaNs + orps %xmm3,%xmm0 # combine + jmp .Lrnsx5 + +# y is infinite or so large that the result would +# overflow or underflow. +.Ly_large: + movdqa %xmm0,p_temp(%rsp) + + mov p_xptr(%rsp),%rcx # get pointer to x + mov (%rcx),%eax + mov p_y(%rsp),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp(%rsp) + + mov p_xptr(%rsp),%rcx # get pointer to x + mov 4(%rcx),%eax + mov p_y(%rsp),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+4(%rsp) + + mov p_xptr(%rsp),%rcx # get pointer to x + mov 8(%rcx),%eax + mov p_y(%rsp),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+8(%rsp) + + mov p_xptr(%rsp),%rcx # get pointer to x + mov 12(%rcx),%eax + mov p_y(%rsp),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special6 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+12(%rsp) + + movdqa p_temp(%rsp),%xmm0 + jmp .Lrnsx3 + +# a subroutine to treat an individual x,y pair when y is large or infinity +# assumes x in .Ly(%rip),%eax in ebx. +# returns result in eax +.Lnp_special6: +# handle |x|==1 cases first + mov $0x07FFFFFFF,%r8d + and %eax,%r8d + cmp $0x03f800000,%r8d # jump if |x| !=1 + jnz .Lnps6 + mov $0x03f800000,%eax # return 1 for all |x|==1 + jmp .Lnpx64 + +# cases where |x| !=1 +.Lnps6: + mov $0x07f800000,%ecx + xor %eax,%eax # assume 0 return + test $0x080000000,%ebx + jnz .Lnps62 # jump if y negative +# y = +inf + cmp $0x03f800000,%r8d + cmovg %ecx,%eax # return inf if |x| < 1 + jmp .Lnpx64 +.Lnps62: +# y = -inf + cmp $0x03f800000,%r8d + cmovl %ecx,%eax # return inf if |x| < 1 + jmp .Lnpx64 + +.Lnpx64: + ret + +# handle cases where x is +/- infinity. edx is the mask + .align 16 +.Lx_infinite: + movdqa %xmm0,p_temp(%rsp) + + test $1,%edx + jz .Lxinfa + mov p_xptr(%rsp),%rcx # get pointer to x + mov (%rcx),%eax + mov p_y(%rsp),%ebx + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp(%rsp) +.Lxinfa: + test $2,%edx + jz .Lxinfb + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov 4(%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+4(%rsp) +.Lxinfb: + test $4,%edx + jz .Lxinfc + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov 8(%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+8(%rsp) +.Lxinfc: + test $8,%edx + jz .Lxinfd + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov 12(%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x1 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+12(%rsp) +.Lxinfd: + movdqa p_temp(%rsp),%xmm0 + jmp .Lrnsx1 + +# a subroutine to treat an individual x,y pair when x is +/-infinity +# assumes x in .Ly(%rip),%eax in ebx, inty in ecx. +# returns result in eax +.Lnp_special_x1: # x is infinite + test $0x080000000,%eax # is x positive + jnz .Lnsx11 # jump if not + test $0x080000000,%ebx # is y positive + jz .Lnsx13 # just return if so + xor %eax,%eax # else return 0 + jmp .Lnsx13 + +.Lnsx11: + cmp $1,%ecx # if inty ==1 + jnz .Lnsx12 # jump if not + test $0x080000000,%ebx # is y positive + jz .Lnsx13 # just return if so + mov $0x080000000,%eax # else return -0 + jmp .Lnsx13 +.Lnsx12: # inty <>1 + and $0x07FFFFFFF,%eax # return -x (|x|) if y<0 + test $0x080000000,%ebx # is y positive + jz .Lnsx13 # + xor %eax,%eax # return 0 if y >=0 +.Lnsx13: + ret + + +# handle cases where x is +/- zero. edx is the mask of x,y pairs with |x|=0 + .align 16 +.Lx_zero: + movdqa %xmm0,p_temp(%rsp) + + test $1,%edx + jz .Lxzera + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov (%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp(%rsp) +.Lxzera: + test $2,%edx + jz .Lxzerb + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov 4(%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+4(%rsp) +.Lxzerb: + test $4,%edx + jz .Lxzerc + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov 8(%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+8(%rsp) +.Lxzerc: + test $8,%edx + jz .Lxzerd + mov p_xptr(%rsp),%rcx # get pointer to x + mov p_y(%rsp),%ebx + mov 12(%rcx),%eax + mov p_inty(%rsp),%ecx + sub $8,%rsp + call .Lnp_special_x2 # call the handler for one value + add $8,%rsp + mov %eax,p_temp+12(%rsp) +.Lxzerd: + movdqa p_temp(%rsp),%xmm0 + jmp .Lrnsx2 + +# a subroutine to treat an individual x,y pair when x is +/-0 +# assumes x in .Ly(%rip),%eax in ebx, inty in ecx. +# returns result in eax + .align 16 +.Lnp_special_x2: + cmp $1,%ecx # if inty ==1 + jz .Lnsx21 # jump if so +# handle cases of x=+/-0, y not integer + xor %eax,%eax + mov $0x07f800000,%ecx + test $0x080000000,%ebx # is ypos + cmovnz %ecx,%eax + jmp .Lnsx23 +# y is an integer +.Lnsx21: + xor %r8d,%r8d + mov $0x07f800000,%ecx + test $0x080000000,%ebx # is ypos + cmovnz %ecx,%r8d # set to infinity if not + and $0x080000000,%eax # pickup the sign of x + or %r8d,%eax # and include it in the result +.Lnsx23: + ret + + .data + .align 64 + +.L__mask_sign: .quad 0x08000000080000000 # a sign bit mask + .quad 0x08000000080000000 + +.L__mask_nsign: .quad 0x07FFFFFFF7FFFFFFF # a not sign bit mask + .quad 0x07FFFFFFF7FFFFFFF + +# used by inty +.L__mask_127: .quad 0x00000007F0000007F # EXPBIAS_SP32 + .quad 0x00000007F0000007F + +.L__mask_mant: .quad 0x0007FFFFF007FFFFF # mantissa bit mask + .quad 0x0007FFFFF007FFFFF + +.L__mask_1: .quad 0x00000000100000001 # 1 + .quad 0x00000000100000001 + +.L__mask_2: .quad 0x00000000200000002 # 2 + .quad 0x00000000200000002 + +.L__mask_24: .quad 0x00000001800000018 # 24 + .quad 0x00000001800000018 + +.L__mask_23: .quad 0x00000001700000017 # 23 + .quad 0x00000001700000017 + +# used by special case checking + +.L__float_one: .quad 0x03f8000003f800000 # one + .quad 0x03f8000003f800000 + +.L__mask_inf: .quad 0x07f8000007F800000 # inifinity + .quad 0x07f8000007F800000 + +.L__mask_ninf: .quad 0x0ff800000fF800000 # -inifinity + .quad 0x0ff800000fF800000 + +.L__mask_NaN: .quad 0x07fC000007FC00000 # NaN + .quad 0x07fC000007FC00000 + +.L__mask_sigbit: .quad 0x00040000000400000 # QNaN bit + .quad 0x00040000000400000 + +.L__mask_impbit: .quad 0x00080000000800000 # implicit bit + .quad 0x00080000000800000 + +.L__mask_ly: .quad 0x04f0000004f000000 # large y + .quad 0x04f0000004f000000 + + +
diff --git a/src/gas/vrsasincosf.S b/src/gas/vrsasincosf.S new file mode 100644 index 0000000..2bb70bf --- /dev/null +++ b/src/gas/vrsasincosf.S
@@ -0,0 +1,2008 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrsasincosf.s +# +# A vector implementation of the sincos libm function. +# +# Prototype: +# +# __vrsa_sincosf(int n, float* x, float* ys, float* yc); +# +# Computes Sine and Cosine of x for an array of input values. +# Places the Sine results into the supplied ys array and the Cosine results into the supplied yc array. +# Does not perform error checking. +# Denormal inputs may produce unexpected results. +# This routine computes 4 single precision Sine Cosine values at a time. +# The four values are passed as packed single in xmm0. +# The four Sine results are returned as packed singles in the supplied ys array. +# The four Cosine results are returned as packed singles in the supplied yc array. +# Note that this represents a non-standard ABI usage, as no ABI +# ( and indeed C) currently allows returning 2 values for a function. +# It is expected that some compilers may be able to take advantage of this +# interface when implementing vectorized loops. Using the array implementation +# of the routine requires putting the inputs into memory, and retrieving +# the results from memory. This routine eliminates the need for this +# overhead if the data does not already reside in memory. + +# Author: Harsha Jagasia +# Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 64 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # + +.Lcosarray: + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x03FA5555555502F31 + .quad 0x0BF56C16BF55699D7 # -0.00138889 c2 + .quad 0x0BF56C16BF55699D7 + .quad 0x03EFA015C50A93B49 # 2.48016e-005 c3 + .quad 0x03EFA015C50A93B49 + .quad 0x0BE92524743CC46B8 # -2.75573e-007 c4 + .quad 0x0BE92524743CC46B8 + +.Lsinarray: + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x0BFC555555545E87D + .quad 0x03F811110DF01232D # 0.00833333 s2 + .quad 0x03F811110DF01232D + .quad 0x0BF2A013A88A37196 # -0.000198413 s3 + .quad 0x0BF2A013A88A37196 + .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4 + .quad 0x03EC6DBE4AD1572D5 + +.Lsincosarray: + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x03F811110DF01232D # 0.00833333 s2 + .quad 0x0BF56C16BF55699D7 + .quad 0x0BF2A013A88A37196 # -0.000198413 s3 + .quad 0x03EFA015C50A93B49 + .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4 + .quad 0x0BE92524743CC46B8 + +.Lcossinarray: + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x0BF56C16BF55699D7 # c2 + .quad 0x03F811110DF01232D + .quad 0x03EFA015C50A93B49 # c3 + .quad 0x0BF2A013A88A37196 + .quad 0x0BE92524743CC46B8 # c4 + .quad 0x03EC6DBE4AD1572D5 + +.align 8 + .Levensin_oddcos_tbl: + + .quad .Lsinsin_sinsin_piby4 # 0 * ; Done + .quad .Lsinsin_sincos_piby4 # 1 + ; Done + .quad .Lsinsin_cossin_piby4 # 2 ; Done + .quad .Lsinsin_coscos_piby4 # 3 + ; Done + + .quad .Lsincos_sinsin_piby4 # 4 ; Done + .quad .Lsincos_sincos_piby4 # 5 * ; Done + .quad .Lsincos_cossin_piby4 # 6 ; Done + .quad .Lsincos_coscos_piby4 # 7 ; Done + + .quad .Lcossin_sinsin_piby4 # 8 ; Done + .quad .Lcossin_sincos_piby4 # 9 ; TBD + .quad .Lcossin_cossin_piby4 # 10 * ; Done + .quad .Lcossin_coscos_piby4 # 11 ; Done + + .quad .Lcoscos_sinsin_piby4 # 12 ; Done + .quad .Lcoscos_sincos_piby4 # 13 + ; Done + .quad .Lcoscos_cossin_piby4 # 14 ; Done + .quad .Lcoscos_coscos_piby4 # 15 * ; Done + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + .weak vrsa_sincosf_ + .set vrsa_sincosf_,__vrsa_sincosf__ + .weak vrsa_sincosf__ + .set vrsa_sincosf__,__vrsa_sincosf__ + + .text + .align 16 + .p2align 4,,15 + +#FORTRAN subroutine implementation of array sincos +#VRSA_SINCOSF(N,X,Y,Z) +#C equivalent*/ +#void vrsa_sincosf__(int * n, double *x, double *y, double *z) +#{ +# vrsa_sincosf(*n,x,y,z); +#} + +.globl __vrsa_sincosf__ + .type __vrsa_sincosf__,@function +__vrsa_sincosf__: + mov (%rdi),%edi + + .align 16 + .p2align 4,,15 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# define local variable storage offsets +.equ p_temp,0 # temporary for get/put bits operation +.equ p_temp1,0x10 # temporary for get/put bits operation + +.equ save_xmm6,0x20 # temporary for get/put bits operation +.equ save_xmm7,0x30 # temporary for get/put bits operation +.equ save_xmm8,0x40 # temporary for get/put bits operation +.equ save_xmm9,0x50 # temporary for get/put bits operation +.equ save_xmm0,0x60 # temporary for get/put bits operation +.equ save_xmm11,0x70 # temporary for get/put bits operation +.equ save_xmm12,0x80 # temporary for get/put bits operation +.equ save_xmm13,0x90 # temporary for get/put bits operation +.equ save_xmm14,0x0A0 # temporary for get/put bits operation +.equ save_xmm15,0x0B0 # temporary for get/put bits operation + +.equ r,0x0C0 # pointer to r for remainder_piby2 +.equ rr,0x0D0 # pointer to r for remainder_piby2 +.equ region,0x0E0 # pointer to r for remainder_piby2 + +.equ r1,0x0F0 # pointer to r for remainder_piby2 +.equ rr1,0x0100 # pointer to r for remainder_piby2 +.equ region1,0x0110 # pointer to r for remainder_piby2 + +.equ p_temp2,0x0120 # temporary for get/put bits operation +.equ p_temp3,0x0130 # temporary for get/put bits operation + +.equ p_temp4,0x0140 # temporary for get/put bits operation +.equ p_temp5,0x0150 # temporary for get/put bits operation + +.equ p_original,0x0160 # original x +.equ p_mask,0x0170 # original x +.equ p_sign_sin,0x0180 # original x + +.equ p_original1,0x0190 # original x +.equ p_mask1,0x01A0 # original x +.equ p_sign1_sin,0x01B0 # original x + + +.equ save_r12,0x01C0 # temporary for get/put bits operation +.equ save_r13,0x01D0 # temporary for get/put bits operation + +.equ p_sin,0x01E0 # sin +.equ p_cos,0x01F0 # cos + +.equ save_rdi,0x0200 # temporary for get/put bits operation +.equ save_rsi,0x0210 # temporary for get/put bits operation + +.equ p_sign_cos,0x0220 # Sign of lower cos term +.equ p_sign1_cos,0x0230 # Sign of upper cos term + +.equ save_xa,0x0240 #qword ; leave space for 4 args***** +.equ save_ysa,0x0250 #qword ; leave space for 4 args***** +.equ save_yca,0x0260 #qword ; leave space for 4 args***** + +.equ save_nv,0x0270 #qword +.equ p_iter,0x0280 #qword storage for number of loop iterations + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.globl vrsa_sincosf + .type vrsa_sincosf,@function +vrsa_sincosf: + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# parameters are passed in by Linux as: +# rcx - int n +# rdx - double *x +# r8 - double *y + + sub $0x0298,%rsp + mov %r12,save_r12(%rsp) # save r12 + mov %r13,save_r13(%rsp) # save r13 + + + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#START PROCESS INPUT +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ysa(%rsp) # save ysin_array pointer + mov %rcx,save_yca(%rsp) # save ycos_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + mov %rdi,save_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vrsa_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#START LOOP +.align 16 +.L__vrsa_top: +# build the input _m128d +# movapd .L__real_7fffffffffffffff,%xmm2 # +# mov .L__real_7fffffffffffffff,%rdx # + + mov save_xa(%rsp),%rsi # get x_array pointer + movlps (%rsi),%xmm0 + movhps 8(%rsi),%xmm0 + + prefetch 32(%rsi) + add $16,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + + + + + + + + + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#STARTMAIN + + movhlps %xmm0,%xmm8 + cvtps2pd %xmm0,%xmm10 # convert input to double. + cvtps2pd %xmm8,%xmm1 # convert input to double. + +movdqa %xmm10,%xmm6 +movdqa %xmm1,%xmm7 +movapd .L__real_7fffffffffffffff(%rip),%xmm2 + +andpd %xmm2,%xmm10 #Unsign +andpd %xmm2,%xmm1 #Unsign + +mov %rdi, p_sin(%rsp) # save address for sin return +mov %rsi, p_cos(%rsp) # save address for cos return + +movd %xmm10,%rax #rax is lower arg +movhpd %xmm10, p_temp+8(%rsp) # +mov p_temp+8(%rsp),%rcx #rcx = upper arg + +movd %xmm1,%r8 #r8 is lower arg +movhpd %xmm1, p_temp1+8(%rsp) # +mov p_temp1+8(%rsp),%r9 #r9 = upper arg + +movdqa %xmm10,%xmm12 +movdqa %xmm1,%xmm13 + +pcmpgtd %xmm6,%xmm12 +pcmpgtd %xmm7,%xmm13 +movdqa %xmm12,%xmm6 +movdqa %xmm13,%xmm7 +psrldq $4,%xmm12 +psrldq $4,%xmm13 +psrldq $8,%xmm6 +psrldq $8,%xmm7 + +mov $0x3FE921FB54442D18,%rdx #piby4 + +mov $0x411E848000000000,%r10 #5e5 + +movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + + +por %xmm6,%xmm12 +por %xmm7,%xmm13 + +movd %xmm12,%r12 #Move Sign to gpr ** +movd %xmm13,%r13 #Move Sign to gpr ** + +movapd %xmm10,%xmm2 #x0 +movapd %xmm1,%xmm3 #x1 +movapd %xmm10,%xmm6 #x0 +movapd %xmm1,%xmm7 #x1 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm2 = x, xmm4 =0.5/t, xmm6 =x +# xmm3 = x, xmm5 =0.5/t, xmm7 =x +.align 16 +.Leither_or_both_arg_gt_than_piby4: + cmp %r10,%rax + jae .Lfirst_or_next3_arg_gt_5e5 + + cmp %r10,%rcx + jae .Lsecond_or_next2_arg_gt_5e5 + + cmp %r10,%r8 + jae .Lthird_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfourth_arg_gt_5e5 + + +# /* Find out what multiple of piby2 */ +# npi2 = (int)(x * twobypi + 0.5); + movapd .L__real_3fe45f306dc9c883(%rip),%xmm10 + mulpd %xmm10,%xmm2 # * twobypi + mulpd %xmm10,%xmm3 # * twobypi + + addpd %xmm4,%xmm2 # +0.5, npi2 + addpd %xmm4,%xmm3 # +0.5, npi2 + + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + + cvtdq2pd %xmm4,%xmm2 # and back to double. + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + + movd %xmm4,%r8 # Region + movd %xmm5,%r9 # Region + +#DELETE +# mov .LQWORD,%rdx PTR __reald_one_zero ;compare value for cossin path +#DELETE + + mov %r8,%r10 + mov %r9,%r11 + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm1,%xmm7 # t-rhead + + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = npi2 (int), xmm10 =rhead, xmm8 =rtail, r8 = region, r10 = region, r12 = Sign +# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail, r9 = region, r11 = region, r13 = Sign + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + +# NEW + + #ADDED + mov %r10,%rdi # npi2 in int + mov %r11,%rsi # npi2 in int + #ADDED + + shr $1,%r10 # 0 and 1 => 0 + shr $1,%r11 # 2 and 3 => 1 + + mov %r10,%rax + mov %r11,%rcx + + #ADDED + xor %r10,%rdi # xor last 2 bits of region for cos + xor %r11,%rsi # xor last 2 bits of region for cos + #ADDED + + not %r12 #~(sign) + not %r13 #~(sign) + and %r12,%r10 #region & ~(sign) + and %r13,%r11 #region & ~(sign) + + not %rax #~(region) + not %rcx #~(region) + not %r12 #~~(sign) + not %r13 #~~(sign) + and %r12,%rax #~region & ~~(sign) + and %r13,%rcx #~region & ~~(sign) + + #ADDED + and .L__reald_one_one(%rip),%rdi # sign for cos + and .L__reald_one_one(%rip),%rsi # sign for cos + #ADDED + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 # sign for sin + and .L__reald_one_one(%rip),%r11 # sign for sin + + + + + + + + mov %r10,%r12 + mov %r11,%r13 + + #ADDED + mov %rdi,%rax + mov %rsi,%rcx + #ADDED + + and .L__reald_one_zero(%rip),%r12 #mask out the lower sign bit leaving the upper sign bit + and .L__reald_one_zero(%rip),%r13 #mask out the lower sign bit leaving the upper sign bit + + #ADDED + and .L__reald_one_zero(%rip),%rax #mask out the lower sign bit leaving the upper sign bit + and .L__reald_one_zero(%rip),%rcx #mask out the lower sign bit leaving the upper sign bit + #ADDED + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + #ADDED + shl $63,%rdi #shift lower sign bit left by 63 bits + shl $63,%rsi #shift lower sign bit left by 63 bits + shl $31,%rax #shift upper sign bit left by 31 bits + shl $31,%rcx #shift upper sign bit left by 31 bits + #ADDED + + mov %r10,p_sign_sin(%rsp) #write out lower sign bit + mov %r12,p_sign_sin+8(%rsp) #write out upper sign bit + mov %r11,p_sign1_sin(%rsp) #write out lower sign bit + mov %r13,p_sign1_sin+8(%rsp) #write out upper sign bit + + mov %rdi,p_sign_cos(%rsp) #write out lower sign bit + mov %rax,p_sign_cos+8(%rsp) #write out upper sign bit + mov %rsi,p_sign1_cos(%rsp) #write out lower sign bit + mov %rcx,p_sign1_cos+8(%rsp) #write out upper sign bit + +# NEW + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead +# xmm4 = Sign, xmm10 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail + movapd %xmm10,%xmm6 # rhead + movapd %xmm1,%xmm7 # rhead + + subpd %xmm8,%xmm10 # r = rhead - rtail + subpd %xmm9,%xmm1 # r = rhead - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail + +# subpd %xmm10,%xmm6 ;rr=rhead-r +# subpd %xmm1,%xmm7 ;rr=rhead-r + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm10,%xmm2 # move r for r2 + movapd %xmm1,%xmm3 # move r for r2 + + mulpd %xmm10,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + +# subpd xmm6, xmm8 ;rr=(rhead-r) -rtail +# subpd xmm7, xmm9 ;rr=(rhead-r) -rtail + + and .L__reald_zero_one(%rip),%rax # region for jump table + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + +# HARSHA ADDED +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign_sin = Sign, p_sign_cos = Sign, xmm10 = r, xmm2 = r2 +# p_sign1_sin = Sign, p_sign1_cos = Sign, xmm1 = r, xmm3 = r2 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm14 # for x3 + movapd %xmm3,%xmm15 # for x3 + + movapd %xmm2,%xmm0 # for r + movapd %xmm3,%xmm11 # for r + + movdqa .Lcosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + + movapd .Lcosarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + movdqa .Lsinarray+0x30(%rip),%xmm6 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm7 # c4 + + movapd .Lsinarray+0x10(%rip),%xmm12 # c2 + movapd .Lsinarray+0x10(%rip),%xmm13 # c2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm10,%xmm14 # x3 + mulpd %xmm1,%xmm15 # x3 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm6 # c2*x2 + mulpd %xmm3,%xmm7 # c2*x2 + + mulpd %xmm2,%xmm12 # c4*x2 + mulpd %xmm3,%xmm13 # c4*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4 + + addpd .Lcosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcosarray(%rip),%xmm9 # c1+x2c2 + + addpd .Lsinarray+0x20(%rip),%xmm6 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm7 # c3+x2c4 + + addpd .Lsinarray(%rip),%xmm12 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm13 # c1+x2c2 + + mulpd %xmm2,%xmm4 # x4(c3+x2c4) + mulpd %xmm3,%xmm5 # x4(c3+x2c4) + + mulpd %xmm2,%xmm6 # x4(c3+x2c4) + mulpd %xmm3,%xmm7 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zc + + addpd %xmm12,%xmm6 # zs + addpd %xmm13,%xmm7 # zs + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + + mulpd %xmm14,%xmm6 # x3 * zs + mulpd %xmm15,%xmm7 # x3 * zs + + subpd %xmm0,%xmm4 # - (-t) + subpd %xmm11,%xmm5 # - (-t) + + addpd %xmm10,%xmm6 # +x + addpd %xmm1,%xmm7 # +x + +# HARSHA ADDED + + lea .Levensin_oddcos_tbl(%rip),%rcx + jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfirst_or_next3_arg_gt_5e5: +# %rcx,,%rax r8, r9 + + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Be sure not to use %xmm3,%xmm1 and xmm7 +# Use %xmm8,,%xmm5 xmm0, xmm12 +# %xmm11,,%xmm9 xmm13 + + + movlpd %xmm10,r(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm10,%xmm10 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm6,%xmm6 + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1 + cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm6,%xmm10 + subsd %xmm0,%xmm10 # xmm10 = r=(rhead-rtail) + subsd %xmm10,%xmm6 # rr=rhead-r + subsd %xmm0,%xmm6 # xmm6 = rr=((rhead-r) -rtail) + movlpd %xmm10,r+8(%rsp) # store upper r + movlpd %xmm6,rr+8(%rsp) # store upper rr + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_lower_naninf + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea r(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrs4_sincosf_lower_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region(%rsp) # region =0 + +.align 16 +0: + + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + + movhlps %xmm10,%xmm6 #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_lower_naninf_of_both_gt_5e5 + + mov %rcx,p_temp(%rsp) #Save upper arg + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea r(%rsp),%rsi + +# added ins- changed input from xmm10 to xmm0 + movd %xmm10,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + mov p_temp(%rsp),%rcx #Restore upper arg + jmp 0f + +.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_upper_naninf_of_both_gt_5e5 + + + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea r+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm6,%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5: + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region+4(%rsp) #region = 0 + +.align 16 +0: + jmp .Lcheck_next2_args + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsecond_or_next2_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Restore xmm4 and %xmm3,,%xmm1 xmm7 +# Can use %xmm0,,%xmm8 xmm12 +# %xmm9,,%xmm5 xmm11, xmm13 + + movhpd %xmm10,r+8(%rsp) #Save upper fp arg for remainder_piby2 call + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1 + cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm1 = piby2_2 + cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store upper region + +# movsd %xmm6,%xmm10 +# subsd xmm10,xmm0 ; xmm10 = r=(rhead-rtail) +# subsd %xmm10,%xmm6 ; rr=rhead-r +# subsd xmm6, xmm0 ; xmm6 = rr=((rhead-r) -rtail) + + subsd %xmm0,%xmm6 # xmm10 = r=(rhead-rtail) + +# movlpd QWORD PTR r[rsp], xmm10 ; store upper r +# movlpd QWORD PTR rr[rsp], xmm6 ; store upper rr + + movlpd %xmm6,r(%rsp) # store upper r + + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_upper_naninf + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea r+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrs4_sincosf_upper_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region+4(%rsp) # region =0 + +.align 16 +0: + + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcheck_next2_args: + + mov $0x411E848000000000,%r10 #5e5 + + + cmp %r10,%r8 + jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfirst_second_done_fourth_arg_gt_5e5 + + + +# Work on next two args, both < 5e5 +# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5 + + movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi + addpd %xmm4,%xmm3 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + cvtdq2pd %xmm5,%xmm3 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm5,region1(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm1,%xmm7 # t-rhead + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +# movapd %xmm1,%xmm7 ; rhead + subpd %xmm9,%xmm1 # r = rhead - rtail + movapd %xmm1,r1(%rsp) + +# subpd %xmm1,%xmm7 ; rr=rhead-r +# subpd xmm7, xmm9 ; rr=(rhead-r) -rtail +# movapd OWORD PTR rr1[rsp], xmm7 + + jmp .L__vrs4_sincosf_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lthird_or_fourth_arg_gt_5e5: +#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Can use %xmm11,,%xmm9 xmm13 +# %xmm8,,%xmm5 xmm0, xmm12 +# Restore xmm4 + +# Work on first two args, both < 5e5 + + + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm4,region(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +# movapd %xmm10,%xmm6 ; rhead + subpd %xmm8,%xmm10 # r = rhead - rtail + movapd %xmm10,r(%rsp) + +# subpd %xmm10,%xmm6 ; rr=rhead-r +# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail +# movapd OWORD PTR rr[rsp], xmm6 + + +# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_third_or_fourth_arg_gt_5e5: +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + + + mov $0x411E848000000000,%r10 #5e5 + + cmp %r10,%r9 + jae .Lboth_arg_gt_5e5_higher + + +# Upper Arg is <5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg + movhlps %xmm3,%xmm3 + movhlps %xmm7,%xmm7 + + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r9d,region1+4(%rsp) # store upper region + + + subsd %xmm10,%xmm7 # xmm1 = r=(rhead-rtail) + + movlpd %xmm7,r1+8(%rsp) # store upper r + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_lower_naninf_higher + + lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf + lea r1(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r1(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_sincosf_lower_naninf_higher: + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region1(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrs4_sincosf_reconstruct + + + + + + + +.align 16 +.Lboth_arg_gt_5e5_higher: +# Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + + movhlps %xmm1,%xmm7 #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_lower_naninf_of_both_gt_5e5_higher + + mov %r9,p_temp1(%rsp) #Save upper arg + lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf + lea r1(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm1,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp1(%rsp),%r9 #Restore upper arg + + + jmp 0f + +.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region1(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_upper_naninf_of_both_gt_5e5_higher + + lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea r1+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm7,%rdi #Restore upper fp arg for remainder_piby2 call + + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5_higher: + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region1+4(%rsp) #region = 0 + +.align 16 +0: + + jmp .L__vrs4_sincosf_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfourth_arg_gt_5e5: +#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5 +#%rcx,,%rax r8, r9 +#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + +# Work on first two args, both < 5e5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm4,region(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +# movapd %xmm10,%xmm6 ; rhead + subpd %xmm8,%xmm10 # r = rhead - rtail + movapd %xmm10,r(%rsp) + +# subpd %xmm10,%xmm6 ; rr=rhead-r +# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail +# movapd OWORD PTR rr[rsp], xmm6 + + +# Work on next two args, third arg < 5e5, fourth arg >= 5e5 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_fourth_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call + + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r8d,region1(%rsp) # store lower region + +# movsd %xmm7,%xmm1 +# subsd xmm1, xmm10 ; xmm10 = r=(rhead-rtail) +# subsd %xmm1,%xmm7 ; rr=rhead-r +# subsd xmm7, xmm10 ; xmm6 = rr=((rhead-r) -rtail) + + subsd %xmm10,%xmm7 # xmm10 = r=(rhead-rtail) + +# movlpd QWORD PTR r1[rsp], xmm1 ; store upper r +# movlpd QWORD PTR rr1[rsp], xmm7 ; store upper rr + + movlpd %xmm7,r1(%rsp) # store upper r + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrs4_sincosf_upper_naninf_higher + + lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea r1+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r1+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_sincosf_upper_naninf_higher: + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region1+4(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrs4_sincosf_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrs4_sincosf_reconstruct: +#Results +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign_sin = Sign, ; p_sign_cos = Sign, xmm10 = r, xmm2 = r2 +# p_sign1_sin = Sign, ; p_sign1_cos = Sign, xmm1 = r, xmm3 = r2 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd r(%rsp),%xmm10 + movapd r1(%rsp),%xmm1 + + mov region(%rsp),%r8 + mov region1(%rsp),%r9 + + mov %r8,%r10 + mov %r9,%r11 + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + +# NEW + + #ADDED + mov %r10,%rdi + mov %r11,%rsi + #ADDED + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + mov %r10,%rax + mov %r11,%rcx + + #ADDED + xor %r10,%rdi + xor %r11,%rsi + #ADDED + + not %r12 #ADDED TO CHANGE THE LOGIC + not %r13 #ADDED TO CHANGE THE LOGIC + and %r12,%r10 + and %r13,%r11 + + not %rax + not %rcx + not %r12 + not %r13 + and %r12,%rax + and %r13,%rcx + + #ADDED + and .L__reald_one_one(%rip),%rdi #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%rsi #(~AB+A~B)&1 + #ADDED + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + + + + + + + mov %r10,%r12 + mov %r11,%r13 + + #ADDED + mov %rdi,%rax + mov %rsi,%rcx + #ADDED + + and .L__reald_one_zero(%rip),%r12 #mask out the lower sign bit leaving the upper sign bit + and .L__reald_one_zero(%rip),%r13 #mask out the lower sign bit leaving the upper sign bit + + #ADDED + and .L__reald_one_zero(%rip),%rax #mask out the lower sign bit leaving the upper sign bit + and .L__reald_one_zero(%rip),%rcx #mask out the lower sign bit leaving the upper sign bit + #ADDED + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + #ADDED + shl $63,%rdi #shift lower sign bit left by 63 bits + shl $63,%rsi #shift lower sign bit left by 63 bits + shl $31,%rax #shift upper sign bit left by 31 bits + shl $31,%rcx #shift upper sign bit left by 31 bits + #ADDED + + mov %r10,p_sign_sin(%rsp) #write out lower sign bit + mov %r12,p_sign_sin+8(%rsp) #write out upper sign bit + mov %r11,p_sign1_sin(%rsp) #write out lower sign bit + mov %r13,p_sign1_sin+8(%rsp) #write out upper sign bit + + mov %rdi,p_sign_cos(%rsp) #write out lower sign bit + mov %rax,p_sign_cos+8(%rsp) #write out upper sign bit + mov %rsi,p_sign1_cos(%rsp) #write out lower sign bit + mov %rcx,p_sign1_cos+8(%rsp) #write out upper sign bit +#NEW + + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm10,%xmm2 + movapd %xmm1,%xmm3 + + mulpd %xmm10,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + +# HARSHA ADDED +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign_cos = Sign, p_sign_sin = Sign, xmm10 = r, xmm2 = r2 +# p_sign1_cos = Sign, p_sign1_sin = Sign, xmm1 = r, xmm3 = r2 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm14 # for x3 + movapd %xmm3,%xmm15 # for x3 + + movapd %xmm2,%xmm0 # for r + movapd %xmm3,%xmm11 # for r + + movdqa .Lcosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + + movapd .Lcosarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + movdqa .Lsinarray+0x30(%rip),%xmm6 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm7 # c4 + + movapd .Lsinarray+0x10(%rip),%xmm12 # c2 + movapd .Lsinarray+0x10(%rip),%xmm13 # c2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm10,%xmm14 # x3 + mulpd %xmm1,%xmm15 # x3 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm6 # c2*x2 + mulpd %xmm3,%xmm7 # c2*x2 + + mulpd %xmm2,%xmm12 # c4*x2 + mulpd %xmm3,%xmm13 # c4*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4 + + addpd .Lcosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcosarray(%rip),%xmm9 # c1+x2c2 + + addpd .Lsinarray+0x20(%rip),%xmm6 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm7 # c3+x2c4 + + addpd .Lsinarray(%rip),%xmm12 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm13 # c1+x2c2 + + mulpd %xmm2,%xmm4 # x4(c3+x2c4) + mulpd %xmm3,%xmm5 # x4(c3+x2c4) + + mulpd %xmm2,%xmm6 # x4(c3+x2c4) + mulpd %xmm3,%xmm7 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zc + + addpd %xmm12,%xmm6 # zs + addpd %xmm13,%xmm7 # zs + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + + mulpd %xmm14,%xmm6 # x3 * zs + mulpd %xmm15,%xmm7 # x3 * zs + + subpd %xmm0,%xmm4 # - (-t) + subpd %xmm11,%xmm5 # - (-t) + + addpd %xmm10,%xmm6 # +x + addpd %xmm1,%xmm7 # +x + +# HARSHA ADDED + + + lea .Levensin_oddcos_tbl(%rip),%rcx + jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + + + + + + + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrsa_sincosf_cleanup: + + movapd p_sign_cos(%rsp),%xmm10 + movapd p_sign1_cos(%rsp),%xmm1 + xorpd %xmm4,%xmm10 # Cos term (+) Sign + xorpd %xmm5,%xmm1 # Cos term (+) Sign + + cvtpd2ps %xmm10,%xmm0 + cvtpd2ps %xmm1,%xmm11 + + movapd p_sign_sin(%rsp),%xmm14 + movapd p_sign1_sin(%rsp),%xmm15 + xorpd %xmm6,%xmm14 # Sin term (+) Sign + xorpd %xmm7,%xmm15 # Sin term (+) Sign + + cvtpd2ps %xmm14,%xmm12 + cvtpd2ps %xmm15,%xmm13 + + +.L__vrsa_bottom1: +# store the result _m128d + + mov save_ysa(%rsp),%r8 + mov save_yca(%rsp),%r9 + + movlps %xmm0, (%r9) # save the cos + movlps %xmm12, (%r8) # save the sin + movlps %xmm11, 8(%r9) # save the cos + movlps %xmm13, 8(%r8) # save the sin + + + prefetch 32(%r8) + prefetch 32(%r9) + + add $16,%r8 + add $16,%r9 + + mov %r8,save_ysa(%rsp) # save y_sinarray pointer + mov %r9,save_yca(%rsp) # save y_cosarray pointer + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vrsa_top + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vrsa_cleanup + +.L__final_check: + + mov save_r12(%rsp),%r12 # restore r12 + mov save_r13(%rsp),%r13 # restore r13 + + add $0x0298,%rsp + ret + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# we jump here when we have an odd number of cos calls to make at the end +# we assume that rdx is pointing at the next x array element, r8 at the next y array element. +# The number of values left is in save_nv + +.align 16 +.L__vrsa_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__final_check # exit if not + + mov save_xa(%rsp),%rsi + mov save_ysa(%rsp),%rdi + mov save_yca(%rsp),%r12 + + +# fill in a m128d with zeroes and the extra values and then make a recursive call. + xorps %xmm0,%xmm0 + movss %xmm0,p_temp+4(%rsp) + movlps %xmm0,p_temp+8(%rsp) + + + mov (%rsi),%ecx # we know there's at least one + mov %ecx,p_temp(%rsp) + cmp $2,%rax + jl .L__vrsacg + + mov 4(%rsi),%ecx # do the second value + mov %ecx,p_temp+4(%rsp) + cmp $3,%rax + jl .L__vrsacg + + mov 8(%rsi),%ecx # do the third value + mov %ecx,p_temp+8(%rsp) + +.L__vrsacg: + mov $4,%rdi # parameter for N + lea p_temp(%rsp),%rsi # &x parameter + lea p_temp2(%rsp),%rdx # &ys parameter + lea p_temp3(%rsp),%rcx # &yc parameter + call vrsa_sincosf@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ysa(%rsp),%rdi + mov save_yca(%rsp),%r12 + mov save_nv(%rsp),%rax # get number of values + + mov p_temp2(%rsp),%ecx + mov %ecx,(%rdi) # we know there's at least one + mov p_temp3(%rsp),%edx + mov %edx,(%r12) # we know there's at least one + cmp $2,%rax + jl .L__vrsacgf + + mov p_temp2+4(%rsp),%ecx + mov %ecx,4(%rdi) # do the second value + mov p_temp3+4(%rsp),%edx + mov %edx,4(%r12) # do the second value + cmp $3,%rax + jl .L__vrsacgf + + mov p_temp2+8(%rsp),%ecx + mov %ecx,8(%rdi) # do the third value + mov p_temp3+8(%rsp),%edx + mov %edx,8(%r12) # do the third value + +.L__vrsacgf: + jmp .L__final_check + + + + + + + + + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +.align 16 +.Lcoscos_coscos_piby4: +# Cos in %xmm5,%xmm4 +# Sin in %xmm7,%xmm6 +# Lower and Upper Even + + movapd %xmm4,%xmm8 + movapd %xmm5,%xmm9 + + movapd %xmm6,%xmm4 + movapd %xmm7,%xmm5 + + movapd %xmm8,%xmm6 + movapd %xmm9,%xmm7 + + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lcossin_cossin_piby4: + + movhlps %xmm5,%xmm9 + movhlps %xmm7,%xmm13 + + movlhps %xmm9,%xmm7 + movlhps %xmm13,%xmm5 + + movhlps %xmm4,%xmm8 + movhlps %xmm6,%xmm12 + + movlhps %xmm8,%xmm6 + movlhps %xmm12,%xmm4 + + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lsincos_cossin_piby4: + movsd %xmm5,%xmm9 + movsd %xmm7,%xmm13 + + movsd %xmm9,%xmm7 + movsd %xmm13,%xmm5 + + movhlps %xmm4,%xmm8 + movhlps %xmm6,%xmm12 + + movlhps %xmm8,%xmm6 + movlhps %xmm12,%xmm4 + + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lsincos_sincos_piby4: + movsd %xmm5,%xmm9 + movsd %xmm7,%xmm13 + + movsd %xmm9,%xmm7 + movsd %xmm13,%xmm5 + + movsd %xmm4,%xmm8 + movsd %xmm6,%xmm12 + + movsd %xmm8,%xmm6 + movsd %xmm12,%xmm4 + + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lcossin_sincos_piby4: + movhlps %xmm5,%xmm9 + movhlps %xmm7,%xmm13 + + movlhps %xmm9,%xmm7 + movlhps %xmm13,%xmm5 + + movsd %xmm4,%xmm8 + movsd %xmm6,%xmm12 + + movsd %xmm8,%xmm6 + movsd %xmm12,%xmm4 + + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lcoscos_sinsin_piby4: +# Cos in %xmm5,%xmm4 +# Sin in %xmm7,%xmm6 +# Lower even, Upper odd, Swap upper + + movapd %xmm5,%xmm9 + movapd %xmm7,%xmm5 + movapd %xmm9,%xmm7 + + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lsinsin_coscos_piby4: +# Cos in %xmm5,%xmm4 +# Sin in %xmm7,%xmm6 +# Lower odd, Upper even, Swap lower + + movapd %xmm4,%xmm8 + movapd %xmm6,%xmm4 + movapd %xmm8,%xmm6 + + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lcoscos_cossin_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + + movapd %xmm5,%xmm9 + movapd %xmm7,%xmm5 + movapd %xmm9,%xmm7 + + movhlps %xmm4,%xmm8 + movhlps %xmm6,%xmm12 + + movlhps %xmm8,%xmm6 + movlhps %xmm12,%xmm4 + + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lcoscos_sincos_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + + movapd %xmm5,%xmm9 + movapd %xmm7,%xmm5 + movapd %xmm9,%xmm7 + + movsd %xmm4,%xmm8 + movsd %xmm6,%xmm12 + + movsd %xmm8,%xmm6 + movsd %xmm12,%xmm4 + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lcossin_coscos_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + + movapd %xmm4,%xmm8 + movapd %xmm6,%xmm4 + movapd %xmm8,%xmm6 + + movhlps %xmm5,%xmm9 + movhlps %xmm7,%xmm13 + + movlhps %xmm9,%xmm7 + movlhps %xmm13,%xmm5 + + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lcossin_sinsin_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + movhlps %xmm5,%xmm9 + movhlps %xmm7,%xmm13 + + movlhps %xmm9,%xmm7 + movlhps %xmm13,%xmm5 + + jmp .L__vrsa_sincosf_cleanup + + +.align 16 +.Lsincos_coscos_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + movapd %xmm4,%xmm8 + movapd %xmm6,%xmm4 + movapd %xmm8,%xmm6 + + movsd %xmm5,%xmm9 + movsd %xmm7,%xmm13 + + movsd %xmm9,%xmm7 + movsd %xmm13,%xmm5 + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lsincos_sinsin_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + movsd %xmm5,%xmm9 + movsd %xmm7,%xmm5 + movsd %xmm9,%xmm7 + + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lsinsin_cossin_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + movhlps %xmm4,%xmm8 + movhlps %xmm6,%xmm12 + + movlhps %xmm8,%xmm6 + movlhps %xmm12,%xmm4 + + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lsinsin_sincos_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 + movsd %xmm4,%xmm8 + movsd %xmm6,%xmm4 + movsd %xmm8,%xmm6 + jmp .L__vrsa_sincosf_cleanup + +.align 16 +.Lsinsin_sinsin_piby4: +# Cos in xmm4 and xmm5 +# Sin in xmm6 and xmm7 +# Lower and Upper odd, So Swap + + jmp .L__vrsa_sincosf_cleanup
diff --git a/src/gas/vrsasinf.S b/src/gas/vrsasinf.S new file mode 100644 index 0000000..6cbff59 --- /dev/null +++ b/src/gas/vrsasinf.S
@@ -0,0 +1,2441 @@ + +# +# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +# +# This file is part of libacml_mv. +# +# libacml_mv is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# libacml_mv is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with libacml_mv. If not, see +# <http://www.gnu.org/licenses/>. +# +# + + + + + +# +# vrsasinf.s +# +# A vector implementation of the sin libm function. +# +# Prototype: +# +# vrsa_sinf(int n, float* x, float* y); +# +# Computes Sine of x for an array of input values. +# Places the results into the supplied y array. +# Does not perform error checking. +# Denormal inputs may produce unexpected results. +# This inlines a routine that computes 4 single precision Sine values at a time. +# The four values are passed as packed single in xmm10. +# The four results are returned as packed singles in xmm10. +# Note that this represents a non-standard ABI usage, as no ABI +# ( and indeed C) currently allows returning 2 values for a function. +# It is expected that some compilers may be able to take advantage of this +# interface when implementing vectorized loops. Using the array implementation +# of the routine requires putting the inputs into memory, and retrieving +# the results from memory. This routine eliminates the need for this +# overhead if the data does not already reside in memory. +# Author: Harsha Jagasia +# Email: harsha.jagasia@amd.com + +#ifdef __ELF__ +.section .note.GNU-stack,"",@progbits +#endif + +.data +.align 64 +.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero + .quad 0x07fffffffffffffff +.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0 + .quad 0x03ff0000000000000 +.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27 + .quad 0x03e40000000000000 +.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5 + .quad 0x03fe0000000000000 +.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666 + .quad 0x03fc5555555555555 +.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi + .quad 0x03fe45f306dc9c883 +.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1 + .quad 0x03ff921fb54400000 +.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail + .quad 0x03dd0b4611a626331 +.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2 + .quad 0x03dd0b4611a600000 +.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail + .quad 0x03ba3198a2e037073 +.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail + .quad 0x0fffffffff8000000 +.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit + .quad 0x08000000000000000 +.L__reald_one_one: .quad 0x00000000100000001 # + .quad 0 +.L__reald_two_two: .quad 0x00000000200000002 # + .quad 0 +.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter + .quad 0 +.L__reald_zero_one: .quad 0x00000000000000001 # + .quad 0 +.L__reald_two_zero: .quad 0x00000000200000000 # + .quad 0 +.L__realq_one_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # +.L__realq_two_two: .quad 0x00000000000000002 # + .quad 0x00000000000000002 # +.L__real_1_x_mask: .quad 0x0ffffffffffffffff # + .quad 0x03ff0000000000000 # +.L__real_zero: .quad 0x00000000000000000 # + .quad 0x00000000000000000 # +.L__real_one: .quad 0x00000000000000001 # + .quad 0x00000000000000001 # + +.Lcosarray: + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x03FA5555555502F31 + .quad 0x0BF56C16BF55699D7 # -0.00138889 c2 + .quad 0x0BF56C16BF55699D7 + .quad 0x03EFA015C50A93B49 # 2.48016e-005 c3 + .quad 0x03EFA015C50A93B49 + .quad 0x0BE92524743CC46B8 # -2.75573e-007 c4 + .quad 0x0BE92524743CC46B8 + +.Lsinarray: + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x0BFC555555545E87D + .quad 0x03F811110DF01232D # 0.00833333 s2 + .quad 0x03F811110DF01232D + .quad 0x0BF2A013A88A37196 # -0.000198413 s3 + .quad 0x0BF2A013A88A37196 + .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4 + .quad 0x03EC6DBE4AD1572D5 + +.Lsincosarray: + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x03F811110DF01232D # 0.00833333 s2 + .quad 0x0BF56C16BF55699D7 + .quad 0x0BF2A013A88A37196 # -0.000198413 s3 + .quad 0x03EFA015C50A93B49 + .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4 + .quad 0x0BE92524743CC46B8 + +.Lcossinarray: + .quad 0x03FA5555555502F31 # 0.0416667 c1 + .quad 0x0BFC555555545E87D # -0.166667 s1 + .quad 0x0BF56C16BF55699D7 # c2 + .quad 0x03F811110DF01232D + .quad 0x03EFA015C50A93B49 # c3 + .quad 0x0BF2A013A88A37196 + .quad 0x0BE92524743CC46B8 # c4 + .quad 0x03EC6DBE4AD1572D5 + +.align 8 + .Levensin_oddcos_tbl: + + .quad .Lsinsin_sinsin_piby4 # 0 * ; Done + .quad .Lsinsin_sincos_piby4 # 1 + ; Done + .quad .Lsinsin_cossin_piby4 # 2 ; Done + .quad .Lsinsin_coscos_piby4 # 3 + ; Done + + .quad .Lsincos_sinsin_piby4 # 4 ; Done + .quad .Lsincos_sincos_piby4 # 5 * ; Done + .quad .Lsincos_cossin_piby4 # 6 ; Done + .quad .Lsincos_coscos_piby4 # 7 ; Done + + .quad .Lcossin_sinsin_piby4 # 8 ; Done + .quad .Lcossin_sincos_piby4 # 9 ; TBD + .quad .Lcossin_cossin_piby4 # 10 * ; Done + .quad .Lcossin_coscos_piby4 # 11 ; Done + + .quad .Lcoscos_sinsin_piby4 # 12 ; Done + .quad .Lcoscos_sincos_piby4 # 13 + ; Done + .quad .Lcoscos_cossin_piby4 # 14 ; Done + .quad .Lcoscos_coscos_piby4 # 15 * ; Done + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + .weak vrsa_sinf_ + .set vrsa_sinf_,__vrsa_sinf__ + .weak vrsa_sinf__ + .set vrsa_sinf__,__vrsa_sinf__ + + .text + .align 16 + .p2align 4,,15 + +#FORTRAN subroutine implementation of array sin +#VRSA_SINF(N,X,Y) +#C equivalent*/ +#void vrsa_sinf__(int * n, double *x, double *y) +#{ +# vrsa_sinf(*n,x,y); +#} + +.globl __vrsa_sinf__ + .type __vrsa_sinf__,@function +__vrsa_sinf__: + mov (%rdi),%edi + + .align 16 + .p2align 4,,15 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# define local variable storage offsets +.equ p_temp,0 # temporary for get/put bits operation +.equ p_temp1,0x10 # temporary for get/put bits operation + +.equ save_xmm6,0x20 # temporary for get/put bits operation +.equ save_xmm7,0x30 # temporary for get/put bits operation +.equ save_xmm8,0x40 # temporary for get/put bits operation +.equ save_xmm9,0x50 # temporary for get/put bits operation +.equ save_xmm0,0x60 # temporary for get/put bits operation +.equ save_xmm11,0x70 # temporary for get/put bits operation +.equ save_xmm12,0x80 # temporary for get/put bits operation +.equ save_xmm13,0x90 # temporary for get/put bits operation +.equ save_xmm14,0x0A0 # temporary for get/put bits operation +.equ save_xmm15,0x0B0 # temporary for get/put bits operation + +.equ r,0x0C0 # pointer to r for remainder_piby2 +.equ rr,0x0D0 # pointer to r for remainder_piby2 +.equ region,0x0E0 # pointer to r for remainder_piby2 + +.equ r1,0x0F0 # pointer to r for remainder_piby2 +.equ rr1,0x0100 # pointer to r for remainder_piby2 +.equ region1,0x0110 # pointer to r for remainder_piby2 + +.equ p_temp2,0x0120 # temporary for get/put bits operation +.equ p_temp3,0x0130 # temporary for get/put bits operation + +.equ p_temp4,0x0140 # temporary for get/put bits operation +.equ p_temp5,0x0150 # temporary for get/put bits operation + +.equ p_original,0x0160 # original x +.equ p_mask,0x0170 # original x +.equ p_sign,0x0180 # original x + +.equ p_original1,0x0190 # original x +.equ p_mask1,0x01A0 # original x +.equ p_sign1,0x01B0 # original x + +.equ save_r12,0x01C0 # temporary for get/put bits operation +.equ save_r13,0x01D0 # temporary for get/put bits operation + +.equ save_xa,0x01E0 #qword +.equ save_ya,0x01F0 #qword + +.equ save_nv,0x0200 #qword +.equ p_iter,0x0210 # qword storage for number of loop iterations + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.globl vrsa_sinf + .type vrsa_sinf,@function +vrsa_sinf: + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# parameters are passed in by Linux as: +# rcx - int n +# rdx - double *x +# r8 - double *y + + sub $0x0228,%rsp + mov %r12,save_r12(%rsp) # save r12 + mov %r13,save_r13(%rsp) # save r13 + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#START PROCESS INPUT +# save the arguments + mov %rsi,save_xa(%rsp) # save x_array pointer + mov %rdx,save_ya(%rsp) # save y_array pointer +#ifdef INTEGER64 + mov %rdi,%rax +#else + mov %edi,%eax + mov %rax,%rdi +#endif + mov %rdi,save_nv(%rsp) # save number of values +# see if too few values to call the main loop + shr $2,%rax # get number of iterations + jz .L__vrsa_cleanup # jump if only single calls +# prepare the iteration counts + mov %rax,p_iter(%rsp) # save number of iterations + shl $2,%rax + sub %rax,%rdi # compute number of extra single calls + mov %rdi,save_nv(%rsp) # save number of left over values + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#START LOOP +.align 16 +.L__vrsa_top: +# build the input _m128d + mov save_xa(%rsp),%rsi # get x_array pointer + movlps (%rsi),%xmm0 + movhps 8(%rsi),%xmm0 + + prefetch 32(%rsi) + add $16,%rsi + mov %rsi,save_xa(%rsp) # save x_array pointer + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# V4 START + movhlps %xmm0,%xmm8 + cvtps2pd %xmm0,%xmm10 # convert input to double. + cvtps2pd %xmm8,%xmm1 # convert input to double. + +movdqa %xmm10,%xmm6 +movdqa %xmm1,%xmm7 +movapd .L__real_7fffffffffffffff(%rip),%xmm2 + +andpd %xmm2,%xmm10 #Unsign +andpd %xmm2,%xmm1 #Unsign + +movd %xmm10,%rax #rax is lower arg +movhpd %xmm10, p_temp+8(%rsp) # +mov p_temp+8(%rsp),%rcx #rcx = upper arg + +movd %xmm1,%r8 #r8 is lower arg +movhpd %xmm1, p_temp1+8(%rsp) # +mov p_temp1+8(%rsp),%r9 #r9 = upper arg + +movdqa %xmm10,%xmm12 +movdqa %xmm1,%xmm13 + +pcmpgtd %xmm6,%xmm12 +pcmpgtd %xmm7,%xmm13 +movdqa %xmm12,%xmm6 +movdqa %xmm13,%xmm7 +psrldq $4,%xmm12 +psrldq $4,%xmm13 +psrldq $8,%xmm6 +psrldq $8,%xmm7 + +mov $0x3FE921FB54442D18,%rdx #piby4 + +mov $0x411E848000000000,%r10 #5e5 + +movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use + + +por %xmm6,%xmm12 +por %xmm7,%xmm13 +movd %xmm12,%r12 #Move Sign to gpr ** +movd %xmm13,%r13 #Move Sign to gpr ** + +movapd %xmm10,%xmm2 #x0 +movapd %xmm1,%xmm3 #x1 +movapd %xmm10,%xmm6 #x0 +movapd %xmm1,%xmm7 #x1 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm2 = x, xmm4 =0.5/t, xmm6 =x +# xmm3 = x, xmm5 =0.5/t, xmm7 =x +.align 16 +.Leither_or_both_arg_gt_than_piby4: + cmp %r10,%rax + jae .Lfirst_or_next3_arg_gt_5e5 + + cmp %r10,%rcx + jae .Lsecond_or_next2_arg_gt_5e5 + + cmp %r10,%r8 + jae .Lthird_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfourth_arg_gt_5e5 + + +# /* Find out what multiple of piby2 */ +# npi2 = (int)(x * twobypi + 0.5); + movapd .L__real_3fe45f306dc9c883(%rip),%xmm10 + mulpd %xmm10,%xmm2 # * twobypi + mulpd %xmm10,%xmm3 # * twobypi + + addpd %xmm4,%xmm2 # +0.5, npi2 + addpd %xmm4,%xmm3 # +0.5, npi2 + + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + + cvtdq2pd %xmm4,%xmm2 # and back to double. + cvtdq2pd %xmm5,%xmm3 # and back to double. + +# /* Subtract the multiple from x to get an extra-precision remainder */ + + movd %xmm4,%r8 # Region + movd %xmm5,%r9 # Region + + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + mov %r8,%r10 + mov %r9,%r11 + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm1,%xmm7 # t-rhead + + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = npi2 (int), xmm10 =rhead, xmm8 =rtail +# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + mov %r10,%rax + mov %r11,%rcx + + not %r12 #ADDED TO CHANGE THE LOGIC + not %r13 #ADDED TO CHANGE THE LOGIC + and %r12,%r10 + and %r13,%r11 + + not %rax + not %rcx + not %r12 + not %r13 + and %r12,%rax + and %r13,%rcx + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r12,p_sign+8(%rsp) #write out upper sign bit + mov %r11,p_sign1(%rsp) #write out lower sign bit + mov %r13,p_sign1+8(%rsp) #write out upper sign bit + +# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead +# xmm4 = Sign, xmm10 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail + movapd %xmm10,%xmm6 # rhead + movapd %xmm1,%xmm7 # rhead + + subpd %xmm8,%xmm10 # r = rhead - rtail + subpd %xmm9,%xmm1 # r = rhead - rtail + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# xmm4 = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail +# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm10,%xmm2 # move r for r2 + movapd %xmm1,%xmm3 # move r for r2 + + mulpd %xmm10,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + lea .Levensin_oddcos_tbl(%rip),%rcx + jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + + + + + + + + + + + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfirst_or_next3_arg_gt_5e5: +# %rcx,,%rax r8, r9 + + cmp %r10,%rcx #is upper arg >= 5e5 + jae .Lboth_arg_gt_5e5 + +.Llower_arg_gt_5e5: +# Upper Arg is < 5e5, Lower arg is >= 5e5 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Be sure not to use %xmm3,%xmm1 and xmm7 +# Use %xmm8,,%xmm5 xmm0, xmm12 +# %xmm11,,%xmm9 xmm13 + + + movlpd %xmm10,r(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm10,%xmm10 #Needed since we want to work on upper arg + movhlps %xmm2,%xmm2 + movhlps %xmm6,%xmm6 + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1 + cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2 + cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %ecx,region+4(%rsp) # store upper region + movsd %xmm6,%xmm10 + subsd %xmm0,%xmm10 # xmm10 = r=(rhead-rtail) + movlpd %xmm10,r+8(%rsp) # store upper r + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_lower_naninf + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx # lower arg is **NOT** nan/inf + lea r(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call + + call __remainder_piby2d2f@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrs4_sinf_lower_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region(%rsp) # region =0 + +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lboth_arg_gt_5e5: +#Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + + movhlps %xmm10,%xmm6 #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %rax,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_lower_naninf_of_both_gt_5e5 + + mov %rcx,p_temp(%rsp) #Save upper arg + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region(%rsp),%rdx #lower arg is **NOT** nan/inf + lea r(%rsp),%rsi + +# added ins- changed input from xmm10 to xmm0 + movd %xmm10,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + mov p_temp(%rsp),%rcx #Restore upper arg + jmp 0f + +.L__vrs4_sinf_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf +# mov .LQWORD,%rax PTR p_original[rsp] + mov $0x00008000000000000,%r11 + or %r11,%rax + mov %rax,r(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_upper_naninf_of_both_gt_5e5 + + + mov %r8,p_temp2(%rsp) + mov %r9,p_temp4(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea r+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm6,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp2(%rsp),%r8 + mov p_temp4(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + + jmp 0f + +.L__vrs4_sinf_upper_naninf_of_both_gt_5e5: + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region+4(%rsp) #region = 0 + +.align 16 +0: + jmp .Lcheck_next2_args + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lsecond_or_next2_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Restore xmm4 and %xmm3,,%xmm1 xmm7 +# Can use %xmm0,,%xmm8 xmm12 +# %xmm9,,%xmm5 xmm11, xmm13 + + movhpd %xmm10,r+8(%rsp) #Save upper fp arg for remainder_piby2 call + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi + addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1 + cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm1 = piby2_2 + cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm2,%xmm8 # npi2 * piby2_1 + subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail + +#t = rhead; + movsd %xmm6,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm2,%xmm12 # npi2 * piby2_2tail + subsd %xmm6,%xmm5 # t-rhead + subsd %xmm5,%xmm0 # (rtail-(t-rhead)) + addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %eax,region(%rsp) # store upper region + + subsd %xmm0,%xmm6 # xmm10 = r=(rhead-rtail) + + movlpd %xmm6,r(%rsp) # store upper r + + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %rcx,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_upper_naninf + + mov %r8,p_temp(%rsp) + mov %r9,p_temp2(%rsp) + movapd %xmm1,p_temp1(%rsp) + movapd %xmm3,p_temp3(%rsp) + movapd %xmm7,p_temp5(%rsp) + + lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea r+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r+8(%rsp),%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp(%rsp),%r8 + mov p_temp2(%rsp),%r9 + movapd p_temp1(%rsp),%xmm1 + movapd p_temp3(%rsp),%xmm3 + movapd p_temp5(%rsp),%xmm7 + jmp 0f + +.L__vrs4_sinf_upper_naninf: + mov $0x00008000000000000,%r11 + or %r11,%rcx + mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region+4(%rsp) # region =0 + +.align 16 +0: + jmp .Lcheck_next2_args + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcheck_next2_args: + + mov $0x411E848000000000,%r10 #5e5 + + + cmp %r10,%r8 + jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5 + + cmp %r10,%r9 + jae .Lfirst_second_done_fourth_arg_gt_5e5 + + + +# Work on next two args, both < 5e5 +# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5 + + movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi + addpd %xmm4,%xmm3 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1 + cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2 + cvtdq2pd %xmm5,%xmm3 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm5,region1(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm3,%xmm1 # npi2 * piby2_1; + +# rtail = npi2 * piby2_2; + mulpd %xmm3,%xmm9 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm7,%xmm1 # t + +# rhead = t - rtail; + subpd %xmm9,%xmm1 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail + + subpd %xmm1,%xmm7 # t-rhead + subpd %xmm7,%xmm9 # - ((t - rhead) - rtail) + addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +# movapd %xmm1,%xmm7 ; rhead + subpd %xmm9,%xmm1 # r = rhead - rtail + movapd %xmm1,r1(%rsp) + +# subpd %xmm1,%xmm7 ; rr=rhead-r +# subpd xmm7, xmm9 ; rr=(rhead-r) -rtail +# movapd OWORD PTR rr1[rsp], xmm7 + + jmp .L__vrs4_sinf_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lthird_or_fourth_arg_gt_5e5: +#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 +# Do not use %xmm3,,%xmm1 xmm7 +# Can use %xmm11,,%xmm9 xmm13 +# %xmm8,,%xmm5 xmm0, xmm12 +# Restore xmm4 + +# Work on first two args, both < 5e5 + + + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm4,region(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +# movapd %xmm10,%xmm6 ; rhead + subpd %xmm8,%xmm10 # r = rhead - rtail + movapd %xmm10,r(%rsp) + +# subpd %xmm10,%xmm6 ; rr=rhead-r +# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail +# movapd OWORD PTR rr[rsp], xmm6 + + +# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5 + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_third_or_fourth_arg_gt_5e5: +# %rcx,,%rax r8, r9 +# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + + + mov $0x411E848000000000,%r10 #5e5 + + cmp %r10,%r9 + jae .Lboth_arg_gt_5e5_higher + + +# Upper Arg is <5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call + movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg + movhlps %xmm3,%xmm3 + movhlps %xmm7,%xmm7 + + +# Work on Upper arg +# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r9d,region1+4(%rsp) # store upper region + + subsd %xmm10,%xmm7 # xmm1 = r=(rhead-rtail) + + movlpd %xmm7,r1+8(%rsp) # store upper r + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + +# Work on Lower arg + mov $0x07ff0000000000000,%r11 # is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_lower_naninf_higher + + lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf + lea r1(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r1(%rsp),%rdi + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_sinf_lower_naninf_higher: + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region1(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrs4_sinf_reconstruct + + + + + + + +.align 16 +.Lboth_arg_gt_5e5_higher: +# Upper Arg is >= 5e5, Lower arg is >= 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + + movhlps %xmm1,%xmm7 #Save upper fp arg for remainder_piby2 call + + mov $0x07ff0000000000000,%r11 #is lower arg nan/inf + mov %r11,%r10 + and %r8,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_lower_naninf_of_both_gt_5e5_higher + + mov %r9,p_temp1(%rsp) #Save upper arg + lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf + lea r1(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm1,%rdi + + call __remainder_piby2d2f@PLT + + mov p_temp1(%rsp),%r9 #Restore upper arg + + jmp 0f + +.L__vrs4_sinf_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf + mov $0x00008000000000000,%r11 + or %r11,%r8 + mov %r8,r1(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region1(%rsp) #region = 0 + +.align 16 +0: + mov $0x07ff0000000000000,%r11 #is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_upper_naninf_of_both_gt_5e5_higher + + lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf + lea r1+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + movd %xmm7,%rdi + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_sinf_upper_naninf_of_both_gt_5e5_higher: + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000 + mov %r10d,region1+4(%rsp) #region = 0 + +.align 16 +0: + jmp .L__vrs4_sinf_reconstruct + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lfourth_arg_gt_5e5: +#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5 +#%rcx,,%rax r8, r9 +#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5 + +# Work on first two args, both < 5e5 + + mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi + addpd %xmm4,%xmm2 # +0.5, npi2 + movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1 + cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers + movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2 + cvtdq2pd %xmm4,%xmm2 # and back to double. + +### +# /* Subtract the multiple from x to get an extra-precision remainder */ + movlpd %xmm4,region(%rsp) # Region +### + +# rhead = x - npi2 * piby2_1; + mulpd %xmm2,%xmm10 # npi2 * piby2_1; +# rtail = npi2 * piby2_2; + mulpd %xmm2,%xmm8 # rtail + +# rhead = x - npi2 * piby2_1; + subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1; + +# t = rhead; + movapd %xmm6,%xmm10 # t + +# rhead = t - rtail; + subpd %xmm8,%xmm10 # rhead + +# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail + + subpd %xmm10,%xmm6 # t-rhead + subpd %xmm6,%xmm8 # - ((t - rhead) - rtail) + addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + +# movapd %xmm10,%xmm6 ; rhead + subpd %xmm8,%xmm10 # r = rhead - rtail + movapd %xmm10,r(%rsp) + +# subpd %xmm10,%xmm6 ; rr=rhead-r +# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail +# movapd OWORD PTR rr[rsp], xmm6 + + +# Work on next two args, third arg < 5e5, fourth arg >= 5e5 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lfirst_second_done_fourth_arg_gt_5e5: + +# Upper Arg is >= 5e5, Lower arg is < 5e5 +# %r9,%r8 +# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5 + + movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call + +# Work on Lower arg +# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg + movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5 + mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi + addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5) + movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1 + cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints + movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2 + cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles + +#/* Subtract the multiple from x to get an extra-precision remainder */ +#rhead = x - npi2 * piby2_1; + mulsd %xmm3,%xmm2 # npi2 * piby2_1 + subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1) + movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail + +#t = rhead; + movsd %xmm7,%xmm5 # xmm5 = t = rhead + +#rtail = npi2 * piby2_2; + mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2) + +#rhead = t - rtail + subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail) + +#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + mulsd %xmm3,%xmm6 # npi2 * piby2_2tail + subsd %xmm7,%xmm5 # t-rhead + subsd %xmm5,%xmm10 # (rtail-(t-rhead)) + addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead)); + +#r = rhead - rtail +#rr = (rhead-r) -rtail + mov %r8d,region1(%rsp) # store lower region + +# movsd %xmm7,%xmm1 +# subsd xmm1, xmm10 ; xmm10 = r=(rhead-rtail) +# subsd %xmm1,%xmm7 ; rr=rhead-r +# subsd xmm7, xmm10 ; xmm6 = rr=((rhead-r) -rtail) + + subsd %xmm10,%xmm7 # xmm10 = r=(rhead-rtail) + +# movlpd QWORD PTR r1[rsp], xmm1 ; store upper r +# movlpd QWORD PTR rr1[rsp], xmm7 ; store upper rr + + movlpd %xmm7,r1(%rsp) # store upper r + +#Work on Upper arg +#Note that volatiles will be trashed by the call +#We do not care since this is the last check +#We will construct r, rr, region and sign + mov $0x07ff0000000000000,%r11 # is upper arg nan/inf + mov %r11,%r10 + and %r9,%r10 + cmp %r11,%r10 + jz .L__vrs4_sinf_upper_naninf_higher + + lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf + lea r1+8(%rsp),%rsi + +# changed input from xmm10 to xmm0 + mov r1+8(%rsp),%rdi + + call __remainder_piby2d2f@PLT + + jmp 0f + +.L__vrs4_sinf_upper_naninf_higher: + mov $0x00008000000000000,%r11 + or %r11,%r9 + mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000 + mov %r10d,region1+4(%rsp) # region =0 + +.align 16 +0: + jmp .L__vrs4_sinf_reconstruct + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrs4_sinf_reconstruct: +#Results +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd r(%rsp),%xmm10 + movapd r1(%rsp),%xmm1 + + mov region(%rsp),%r8 + mov region1(%rsp),%r9 + mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path + + mov %r8,%r10 + mov %r9,%r11 + + and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin + and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin + + shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region + shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region + + mov %r10,%rax + mov %r11,%rcx + + not %r12 #ADDED TO CHANGE THE LOGIC + not %r13 #ADDED TO CHANGE THE LOGIC + and %r12,%r10 + and %r13,%r11 + + not %rax + not %rcx + not %r12 + not %r13 + and %r12,%rax + and %r13,%rcx + + or %rax,%r10 + or %rcx,%r11 + and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1 + and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1 + + mov %r10,%r12 + mov %r11,%r13 + + and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit + and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit + + shl $63,%r10 #shift lower sign bit left by 63 bits + shl $63,%r11 #shift lower sign bit left by 63 bits + shl $31,%r12 #shift upper sign bit left by 31 bits + shl $31,%r13 #shift upper sign bit left by 31 bits + + mov %r10,p_sign(%rsp) #write out lower sign bit + mov %r12,p_sign+8(%rsp) #write out upper sign bit + mov %r11,p_sign1(%rsp) #write out lower sign bit + mov %r13,p_sign1+8(%rsp) #write out upper sign bit + + mov %r8,%rax + mov %r9,%rcx + + movapd %xmm10,%xmm2 + movapd %xmm1,%xmm3 + + mulpd %xmm10,%xmm2 # r2 + mulpd %xmm1,%xmm3 # r2 + + and .L__reald_zero_one(%rip),%rax + and .L__reald_zero_one(%rip),%rcx + shr $31,%r8 + shr $31,%r9 + or %r8,%rax + or %r9,%rcx + shl $2,%rcx + or %rcx,%rax + + + lea .Levensin_oddcos_tbl(%rip),%rcx + jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.L__vrsa_sinf_cleanup: + + movapd p_sign(%rsp),%xmm10 + movapd p_sign1(%rsp),%xmm1 + + xorpd %xmm4,%xmm10 # (+) Sign + xorpd %xmm5,%xmm1 # (+) Sign + + cvtpd2ps %xmm10,%xmm0 + cvtpd2ps %xmm1,%xmm11 + movlhps %xmm11,%xmm0 + +# NEW + +.L__vrsa_bottom1: +# store the result _m128d + mov save_ya(%rsp),%rdi # get y_array pointer + movlps %xmm0,(%rdi) + movhps %xmm0,8(%rdi) + + prefetch 32(%rdi) + add $16,%rdi + mov %rdi,save_ya(%rsp) # save y_array pointer + + mov p_iter(%rsp),%rax # get number of iterations + sub $1,%rax + mov %rax,p_iter(%rsp) # save number of iterations + jnz .L__vrsa_top + +# see if we need to do any extras + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax + jnz .L__vrsa_cleanup + +.L__final_check: + +# NEW + + mov save_r12(%rsp),%r12 # restore r12 + mov save_r13(%rsp),%r13 # restore r13 + + add $0x0228,%rsp + ret + +#NEW + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# we jump here when we have an odd number of cos calls to make at the end +# we assume that rdx is pointing at the next x array element, r8 at the next y array element. +# The number of values left is in save_nv + +.align 16 +.L__vrsa_cleanup: + mov save_nv(%rsp),%rax # get number of values + test %rax,%rax # are there any values + jz .L__final_check # exit if not + + mov save_xa(%rsp),%rsi + mov save_ya(%rsp),%rdi + + +# START WORKING FROM HERE +# fill in a m128d with zeroes and the extra values and then make a recursive call. + xorps %xmm0,%xmm0 + movss %xmm0,p_temp+4(%rsp) + movlps %xmm0,p_temp+8(%rsp) + + + mov (%rsi),%ecx # we know there's at least one + mov %ecx,p_temp(%rsp) + cmp $2,%rax + jl .L__vrsacg + + mov 4(%rsi),%ecx # do the second value + mov %ecx,p_temp+4(%rsp) + cmp $3,%rax + jl .L__vrsacg + + mov 8(%rsi),%ecx # do the third value + mov %ecx,p_temp+8(%rsp) + +.L__vrsacg: + mov $4,%rdi # parameter for N + lea p_temp(%rsp),%rsi # &x parameter + lea p_temp2(%rsp),%rdx # &y parameter + call vrsa_sinf@PLT # call recursively to compute four values + +# now copy the results to the destination array + mov save_ya(%rsp),%rdi + mov save_nv(%rsp),%rax # get number of values + + mov p_temp2(%rsp),%ecx + mov %ecx,(%rdi) # we know there's at least one + cmp $2,%rax + jl .L__vrsacgf + + mov p_temp2+4(%rsp),%ecx + mov %ecx,4(%rdi) # do the second value + cmp $3,%rax + jl .L__vrsacgf + + mov p_temp2+8(%rsp),%ecx + mov %ecx,8(%rdi) # do the third value + +.L__vrsacgf: + jmp .L__final_check + +#NEW + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_coscos_piby4: + movapd %xmm2,%xmm0 # r + movapd %xmm3,%xmm11 # r + + movdqa .Lcosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + + movapd .Lcosarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + + mulpd %xmm2,%xmm8 # c3*x2 + mulpd %xmm3,%xmm9 # c3*x2 + + subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r + + mulpd %xmm2,%xmm2 # x4 + mulpd %xmm3,%xmm3 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4 + + addpd .Lcosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcosarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm2,%xmm4 # x4(c3+x2c4) + mulpd %xmm3,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x4 * zc + + subpd %xmm0,%xmm4 # + t + subpd %xmm11,%xmm5 # + t + + jmp .L__vrsa_sinf_cleanup + +.align 16 +.Lcossin_cossin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movdqa .Lsincosarray+0x30(%rip),%xmm4 # s4 + movdqa .Lsincosarray+0x30(%rip),%xmm5 # s4 + movapd .Lsincosarray+0x10(%rip),%xmm8 # s2 + movapd .Lsincosarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s4 + mulpd %xmm3,%xmm5 # x2s4 + mulpd %xmm2,%xmm8 # x2s2 + mulpd %xmm3,%xmm9 # x2s2 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # s4+x2s3 + addpd .Lsincosarray+0x20(%rip),%xmm5 # s4+x2s3 + addpd .Lsincosarray(%rip),%xmm8 # s2+x2s1 + addpd .Lsincosarray(%rip),%xmm9 # s2+x2s1 + + mulpd %xmm0,%xmm4 # x4(s3+x2s4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + movhlps %xmm0,%xmm0 # move high x4 for cos term + movhlps %xmm11,%xmm11 # move high x4 for cos term + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term + movsd %xmm3,%xmm7 # move low x2 for x3 for sin term + mulsd %xmm10,%xmm6 # get low x3 for sin term + mulsd %xmm1,%xmm7 # get low x3 for sin term + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm2,%xmm12 # move high r for cos + movhlps %xmm3,%xmm13 # move high r for cos + + movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos + movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm7,%xmm5 # sin *x3 + + mulsd %xmm0,%xmm8 # cos *x4 + mulsd %xmm11,%xmm9 # cos *x4 + + subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0 + + addsd %xmm10,%xmm4 # sin + x + addsd %xmm1,%xmm5 # sin + x + subsd %xmm12,%xmm8 # cos+t + subsd %xmm13,%xmm9 # cos+t + + movlhps %xmm8,%xmm4 + movlhps %xmm9,%xmm5 + + jmp .L__vrsa_sinf_cleanup +.align 16 +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.Lsincos_cossin_piby4: + + movapd .Lsincosarray+0x30(%rip),%xmm4 # s4 + movapd .Lcossinarray+0x30(%rip),%xmm5 # s4 + movdqa .Lsincosarray+0x10(%rip),%xmm8 # s2 + movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + movapd %xmm3,%xmm7 # sincos term upper x2 for x3 + + mulpd %xmm2,%xmm4 # x2s4 + mulpd %xmm3,%xmm5 # x2s4 + mulpd %xmm2,%xmm8 # x2s2 + mulpd %xmm3,%xmm9 # x2s2 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # s3+x2s4 + addpd .Lcossinarray+0x20(%rip),%xmm5 # s3+x2s4 + addpd .Lsincosarray(%rip),%xmm8 # s1+x2s2 + addpd .Lcossinarray(%rip),%xmm9 # s1+x2s2 + + mulpd %xmm0,%xmm4 # x4(s3+x2s4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + movhlps %xmm0,%xmm0 # move high x4 for cos term + + movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin) + mulpd %xmm1,%xmm7 + + mulsd %xmm10,%xmm6 # get low x3 for sin term (cossin) + movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos) + + mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + + movhlps %xmm2,%xmm12 # move high r for cos (cossin) + + + movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin) + movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos) + + mulsd %xmm6,%xmm4 # sin *x3 + mulsd %xmm11,%xmm5 # cos *x4 + mulsd %xmm0,%xmm8 # cos *x4 + mulsd %xmm7,%xmm9 # sin *x3 + + subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0 + + movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos) + + addsd %xmm10,%xmm4 # sin + x + + addsd %xmm11,%xmm9 # sin + x + + + subsd %xmm12,%xmm8 # cos+t + subsd %xmm3,%xmm5 # cos+t + + movlhps %xmm8,%xmm4 # cossin + movlhps %xmm9,%xmm5 # sincos + + jmp .L__vrsa_sinf_cleanup + +.align 16 +.Lsincos_sincos_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd .Lcossinarray+0x30(%rip),%xmm4 # s4 + movapd .Lcossinarray+0x30(%rip),%xmm5 # s4 + movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2 + movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + movapd %xmm2,%xmm6 # move x2 for x4 + movapd %xmm3,%xmm7 # move x2 for x4 + + mulpd %xmm2,%xmm4 # x2s6 + mulpd %xmm3,%xmm5 # x2s6 + mulpd %xmm2,%xmm8 # x2s3 + mulpd %xmm3,%xmm9 # x2s3 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # s4+x2s3 + addpd .Lcossinarray+0x20(%rip),%xmm5 # s4+x2s3 + addpd .Lcossinarray(%rip),%xmm8 # s2+x2s1 + addpd .Lcossinarray(%rip),%xmm9 # s2+x2s1 + + mulpd %xmm0,%xmm4 # x4(s4+x2s3) + mulpd %xmm11,%xmm5 # x4(s4+x2s3) + + mulpd %xmm10,%xmm6 # get low x3 for sin term + mulpd %xmm1,%xmm7 # get low x3 for sin term + movhlps %xmm6,%xmm6 # move low x2 for x3 for sin term + movhlps %xmm7,%xmm7 # move low x2 for x3 for sin term + + mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos terms + mulsd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos terms + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + movhlps %xmm4,%xmm12 # xmm8 = sin , xmm4 = cos + movhlps %xmm5,%xmm13 # xmm9 = sin , xmm5 = cos + + mulsd %xmm6,%xmm12 # sin *x3 + mulsd %xmm7,%xmm13 # sin *x3 + mulsd %xmm0,%xmm4 # cos *x4 + mulsd %xmm11,%xmm5 # cos *x4 + + subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm3 #-t=r-1.0 + + movhlps %xmm10,%xmm0 # move high x for x for sin term + movhlps %xmm1,%xmm11 # move high x for x for sin term + # Reverse 10 and 0 + + addsd %xmm0,%xmm12 # sin + x + addsd %xmm11,%xmm13 # sin + x + + subsd %xmm2,%xmm4 # cos+t + subsd %xmm3,%xmm5 # cos+t + + movlhps %xmm12,%xmm4 + movlhps %xmm13,%xmm5 + jmp .L__vrsa_sinf_cleanup + +.align 16 +.Lcossin_sincos_piby4: + + movapd .Lcossinarray+0x30(%rip),%xmm4 # s4 + movapd .Lsincosarray+0x30(%rip),%xmm5 # s4 + movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2 + movdqa .Lsincosarray+0x10(%rip),%xmm9 # s2 + + movapd %xmm2,%xmm0 # move x2 for x4 + movapd %xmm3,%xmm11 # move x2 for x4 + movapd %xmm2,%xmm7 # upper x2 for x3 for sin term (sincos) + + mulpd %xmm2,%xmm4 # x2s4 + mulpd %xmm3,%xmm5 # x2s4 + mulpd %xmm2,%xmm8 # x2s2 + mulpd %xmm3,%xmm9 # x2s2 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # s3+x2s4 + addpd .Lsincosarray+0x20(%rip),%xmm5 # s3+x2s4 + addpd .Lcossinarray(%rip),%xmm8 # s1+x2s2 + addpd .Lsincosarray(%rip),%xmm9 # s1+x2s2 + + mulpd %xmm0,%xmm4 # x4(s3+x2s4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + movhlps %xmm11,%xmm11 # move high x4 for cos term + + movsd %xmm3,%xmm6 # move low x2 for x3 for sin term (cossin) + mulpd %xmm10,%xmm7 + + mulsd %xmm1,%xmm6 # get low x3 for sin term (cossin) + movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos) + + mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term + mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term + + addpd %xmm8,%xmm4 # z + addpd %xmm9,%xmm5 # z + + + movhlps %xmm3,%xmm12 # move high r for cos (cossin) + + + movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos (sincos) + movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin (cossin) + + mulsd %xmm0,%xmm4 # cos *x4 + mulsd %xmm6,%xmm5 # sin *x3 + mulsd %xmm7,%xmm8 # sin *x3 + mulsd %xmm11,%xmm9 # cos *x4 + + subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0 + subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0 + + movhlps %xmm10,%xmm11 # move high x for x for sin term (sincos) + + subsd %xmm2,%xmm4 # cos-(-t) + subsd %xmm12,%xmm9 # cos-(-t) + + addsd %xmm11,%xmm8 # sin + x + addsd %xmm1,%xmm5 # sin + x + + movlhps %xmm8,%xmm4 # cossin + movlhps %xmm9,%xmm5 # sincos + + jmp .L__vrsa_sinf_cleanup +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_sinsin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: SIN +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: COS +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm0 # x2 ; SIN + movapd %xmm3,%xmm11 # x2 ; COS + movapd %xmm3,%xmm1 # copy of x2 for x4 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm0 # x4 + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm3,%xmm1 # x4 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcosarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm10,%xmm2 # x3 + + mulpd %xmm0,%xmm4 # x4(c3+x2c4) + mulpd %xmm1,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm1,%xmm5 # x4 * zc + + addpd %xmm10,%xmm4 # +x + subpd %xmm11,%xmm5 # +t + + jmp .L__vrsa_sinf_cleanup + +.align 16 +.Lsinsin_coscos_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: COS +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: SIN +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + movapd %xmm2,%xmm0 # x2 ; COS + movapd %xmm3,%xmm11 # x2 ; SIN + movapd %xmm2,%xmm10 # copy of x2 for x4 + + movdqa .Lcosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # s4 + movapd .Lcosarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # s2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # s4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # s2*x2 + + mulpd %xmm2,%xmm10 # x4 + mulpd %xmm3,%xmm11 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # s3+x2c4 + addpd .Lcosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # s1+x2c2 + + mulpd %xmm1,%xmm3 # x3 + mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm10,%xmm4 # x4(c3+x2c4) + mulpd %xmm11,%xmm5 # x4(s3+x2s4) + + subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 + addpd %xmm8,%xmm4 # zc + addpd %xmm9,%xmm5 # zs + + mulpd %xmm10,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # x3 * zc + + subpd %xmm0,%xmm4 # +t + addpd %xmm1,%xmm5 # +x + + jmp .L__vrsa_sinf_cleanup +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +.align 16 +.Lcoscos_cossin_piby4: #Derive from cossin_coscos + movhlps %xmm2,%xmm0 # x2 for 0.5x2 for upper cos + movsd %xmm2,%xmm6 # lower x2 for x3 for lower sin + movapd %xmm3,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + movapd .Lsincosarray+0x10(%rip),%xmm8 # c2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lsincosarray(%rip),%xmm8 # c2+x2c1 + addpd .Lcosarray(%rip),%xmm9 # c2+x2c1 + + + movapd %xmm12,%xmm2 # upper=x4 + movsd %xmm6,%xmm2 # lower=x2 + mulsd %xmm10,%xmm2 # lower=x2*x + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # upper= x4 * zc + # lower=x3 * zs + mulpd %xmm13,%xmm5 # x4 * zc + + + movlhps %xmm7,%xmm10 # + addpd %xmm10,%xmm4 # +x for lower sin, +t for upper cos + subpd %xmm11,%xmm5 # -(-t) + + jmp .L__vrsa_sinf_cleanup +.align 16 +.Lcoscos_sincos_piby4: #Derive from cossin_coscos + movsd %xmm2,%xmm0 # x2 for 0.5x2 for lower cos + movapd %xmm3,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lcossinarray+0x30(%rip),%xmm4 # cs4 + movdqa .Lcosarray+0x30(%rip),%xmm5 # c4 + movapd .Lcossinarray+0x10(%rip),%xmm8 # cs2 + movapd .Lcosarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lcossinarray(%rip),%xmm8 # c2+x2c1 + addpd .Lcosarray(%rip),%xmm9 # c2+x2c1 + + mulpd %xmm10,%xmm2 # upper=x3 for sin + mulsd %xmm10,%xmm2 # lower=x4 for cos + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm2,%xmm4 # lower= x4 * zc + # upper= x3 * zs + mulpd %xmm13,%xmm5 # x4 * zc + + + movsd %xmm7,%xmm10 + addpd %xmm10,%xmm4 # +x for upper sin, +t for lower cos + subpd %xmm11,%xmm5 # -(-t) + + jmp .L__vrsa_sinf_cleanup +.align 16 +.Lcossin_coscos_piby4: + movhlps %xmm3,%xmm0 # x2 for 0.5x2 for upper cos + movapd %xmm2,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + movsd %xmm3,%xmm6 # lower x2 for x3 for sin + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4 + movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4 + movapd .Lcosarray+0x10(%rip),%xmm8 # cs2 + movapd .Lsincosarray+0x10(%rip),%xmm9 # c2 + + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lsincosarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lcosarray(%rip),%xmm8 # c2+x2c1 + addpd .Lsincosarray(%rip),%xmm9 # c2+x2c1 + + movapd %xmm13,%xmm3 # upper=x4 + movsd %xmm6,%xmm3 # lower x2 + mulsd %xmm1,%xmm3 # lower x2*x + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm12,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # upper= x4 * zc + # lower=x3 * zs + + movlhps %xmm7,%xmm1 + addpd %xmm1,%xmm5 # +x for lower sin, +t for upper cos + subpd %xmm11,%xmm4 # -(-t) + + jmp .L__vrsa_sinf_cleanup + +.align 16 +.Lcossin_sinsin_piby4: # Derived from sincos_coscos + + movhlps %xmm3,%xmm0 # x2 + movapd %xmm3,%xmm7 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4 + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lsincosarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsincosarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsincosarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm10,%xmm2 # x3 + movapd %xmm13,%xmm3 # upper x4 for cos + movsd %xmm7,%xmm3 # lower x2 for sin + mulsd %xmm1,%xmm3 # lower x3=x2*x for sin + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm11,%xmm1 # t for upper cos and x for lower sin + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zs + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # upper=x4 * zc + # lower=x3 * zs + + addpd %xmm10,%xmm4 # +x + addpd %xmm1,%xmm5 # +t upper, +x lower + + + jmp .L__vrsa_sinf_cleanup +.align 16 +.Lsincos_coscos_piby4: + movsd %xmm3,%xmm0 # x2 for 0.5x2 for lower cos + movapd %xmm2,%xmm11 # x2 for 0.5x2 + movapd %xmm2,%xmm12 # x2 for x4 + movapd %xmm3,%xmm13 # x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm7 + + movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4 + movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4 + movapd .Lcosarray+0x10(%rip),%xmm8 # cs2 + movapd .Lcossinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2 + mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + subsd %xmm0,%xmm7 # t=1.0-r for cos + subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 + mulpd %xmm2,%xmm12 # x4 + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3 + addpd .Lcossinarray+0x20(%rip),%xmm5 # c4+x2c3 + addpd .Lcosarray(%rip),%xmm8 # c2+x2c1 + addpd .Lcossinarray(%rip),%xmm9 # c2+x2c1 + + mulpd %xmm1,%xmm3 # upper=x3 for sin + mulsd %xmm1,%xmm3 # lower=x4 for cos + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zczs + addpd %xmm9,%xmm5 # zc + + mulpd %xmm12,%xmm4 # x4 * zc + mulpd %xmm3,%xmm5 # lower= x4 * zc + # upper= x3 * zs + + movsd %xmm7,%xmm1 + subpd %xmm11,%xmm4 # -(-t) + addpd %xmm1,%xmm5 # +x for upper sin, +t for lower cos + + + jmp .L__vrsa_sinf_cleanup + +.align 16 +.Lsincos_sinsin_piby4: # Derived from sincos_coscos + + movsd %xmm3,%xmm0 # x2 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4 + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lcossinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lcossinarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lcossinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm10,%xmm2 # x3 + mulpd %xmm1,%xmm3 # upper x3 for sin + mulsd %xmm1,%xmm3 # lower x4 for cos + + movhlps %xmm1,%xmm6 + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm6,%xmm11 # upper =t ; lower =x + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # lower=x4 * zc + # upper=x3 * zs + + addpd %xmm10,%xmm4 # +x + addpd %xmm11,%xmm5 # +t lower, +x upper + + jmp .L__vrsa_sinf_cleanup + +.align 16 +.Lsinsin_cossin_piby4: # Derived from sincos_coscos + + movhlps %xmm2,%xmm0 # x2 + movapd %xmm2,%xmm7 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # c4 + movapd .Lsincosarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lsincosarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lsincosarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm1,%xmm3 # x3 + movapd %xmm12,%xmm2 # upper x4 for cos + movsd %xmm7,%xmm2 # lower x2 for sin + mulsd %xmm10,%xmm2 # lower x3=x2*x for sin + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm11,%xmm10 # t for upper cos and x for lower sin + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # upper=x4 * zc + # lower=x3 * zs + + addpd %xmm1,%xmm5 # +x + addpd %xmm10,%xmm4 # +t upper, +x lower + + jmp .L__vrsa_sinf_cleanup + +.align 16 +.Lsinsin_sincos_piby4: # Derived from sincos_coscos + + movsd %xmm2,%xmm0 # x2 + movapd %xmm2,%xmm12 # copy of x2 for x4 + movapd %xmm3,%xmm13 # copy of x2 for x4 + movsd .L__real_3ff0000000000000(%rip),%xmm11 + + movdqa .Lcossinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # c4 + movapd .Lcossinarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # c2 + + mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + mulpd %xmm2,%xmm12 # x4 + subsd %xmm0,%xmm11 # t=1.0-r for cos + mulpd %xmm3,%xmm13 # x4 + + addpd .Lcossinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4 + addpd .Lcossinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm1,%xmm3 # x3 + mulpd %xmm10,%xmm2 # upper x3 for sin + mulsd %xmm10,%xmm2 # lower x4 for cos + + movhlps %xmm10,%xmm6 + + mulpd %xmm12,%xmm4 # x4(c3+x2c4) + mulpd %xmm13,%xmm5 # x4(c3+x2c4) + + movlhps %xmm6,%xmm11 + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zszc + + mulpd %xmm3,%xmm5 # x3 * zs + mulpd %xmm2,%xmm4 # lower=x4 * zc + # upper=x3 * zs + + addpd %xmm1,%xmm5 # +x + addpd %xmm11,%xmm4 # +t lower, +x upper + + jmp .L__vrsa_sinf_cleanup + +.align 16 +.Lsinsin_sinsin_piby4: +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; +# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr +# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr +#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + + #x2 = x * x; + #(x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4)))); + + #x + x3 * ((c1 + x2 *c2) + x4 * (c3 + x2 * c4)); + + + movapd %xmm2,%xmm0 # x2 + movapd %xmm3,%xmm11 # x2 + + movdqa .Lsinarray+0x30(%rip),%xmm4 # c4 + movdqa .Lsinarray+0x30(%rip),%xmm5 # c4 + + mulpd %xmm2,%xmm0 # x4 + mulpd %xmm3,%xmm11 # x4 + + movapd .Lsinarray+0x10(%rip),%xmm8 # c2 + movapd .Lsinarray+0x10(%rip),%xmm9 # c2 + + mulpd %xmm2,%xmm4 # c4*x2 + mulpd %xmm3,%xmm5 # c4*x2 + + mulpd %xmm2,%xmm8 # c2*x2 + mulpd %xmm3,%xmm9 # c2*x2 + + addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4 + addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4 + + mulpd %xmm10,%xmm2 # x3 + mulpd %xmm1,%xmm3 # x3 + + addpd .Lsinarray(%rip),%xmm8 # c1+x2c2 + addpd .Lsinarray(%rip),%xmm9 # c1+x2c2 + + mulpd %xmm0,%xmm4 # x4(c3+x2c4) + mulpd %xmm11,%xmm5 # x4(c3+x2c4) + + addpd %xmm8,%xmm4 # zs + addpd %xmm9,%xmm5 # zs + + mulpd %xmm2,%xmm4 # x3 * zs + mulpd %xmm3,%xmm5 # x3 * zs + + addpd %xmm10,%xmm4 # +x + addpd %xmm1,%xmm5 # +x + + jmp .L__vrsa_sinf_cleanup
diff --git a/src/hypot.c b/src/hypot.c new file mode 100644 index 0000000..063d526 --- /dev/null +++ b/src/hypot.c
@@ -0,0 +1,223 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_SCALEDOUBLE_1 +#define USE_INFINITY_WITH_FLAGS +#define USE_HANDLE_ERROR +#include "../inc/libm_inlines_amd.h" +#undef USE_SCALEDOUBLE_1 +#undef USE_INFINITY_WITH_FLAGS +#undef USE_HANDLE_ERROR + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range result */ +static inline double retval_errno_erange_overflow(double x, double y) +{ + struct exception exc; + exc.arg1 = x; + exc.arg2 = y; + exc.type = OVERFLOW; + exc.name = (char *)"hypot"; + if (_LIB_VERSION == _SVID_) + exc.retval = HUGE; + else + exc.retval = infinity_with_flags(AMD_F_OVERFLOW | AMD_F_INEXACT); + if (_LIB_VERSION == _POSIX_) + __set_errno(ERANGE); + else if (!matherr(&exc)) + __set_errno(ERANGE); + return exc.retval; +} +#endif + +#ifdef WINDOWS +double FN_PROTOTYPE(hypot)(double x, double y) +#else +double FN_PROTOTYPE(hypot)(double x, double y) +#endif +{ + /* Returns sqrt(x*x + y*y) with no overflow or underflow unless + the result warrants it */ + + const double large = 1.79769313486231570815e+308; /* 0x7fefffffffffffff */ + + double u, r, retval, hx, tx, x2, hy, ty, y2, hs, ts; + unsigned long long xexp, yexp, ux, uy, ut; + int dexp, expadjust; + + GET_BITS_DP64(x, ux); + ux &= ~SIGNBIT_DP64; + GET_BITS_DP64(y, uy); + uy &= ~SIGNBIT_DP64; + xexp = (ux >> EXPSHIFTBITS_DP64); + yexp = (uy >> EXPSHIFTBITS_DP64); + + if (xexp == BIASEDEMAX_DP64 + 1 || yexp == BIASEDEMAX_DP64 + 1) + { + /* One or both of the arguments are NaN or infinity. The + result will also be NaN or infinity. */ + retval = x*x + y*y; + if (((xexp == BIASEDEMAX_DP64 + 1) && !(ux & MANTBITS_DP64)) || + ((yexp == BIASEDEMAX_DP64 + 1) && !(uy & MANTBITS_DP64))) + /* x or y is infinity. ISO C99 defines that we must + return +infinity, even if the other argument is NaN. + Note that the computation of x*x + y*y above will already + have raised invalid if either x or y is a signalling NaN. */ + return infinity_with_flags(0); + else + /* One or both of x or y is NaN, and neither is infinity. + Raise invalid if it's a signalling NaN */ + return retval; + } + + /* Set x = abs(x) and y = abs(y) */ + PUT_BITS_DP64(ux, x); + PUT_BITS_DP64(uy, y); + + /* The difference in exponents between x and y */ + dexp = (int)(xexp - yexp); + expadjust = 0; + + if (ux == 0) + /* x is zero */ + return y; + else if (uy == 0) + /* y is zero */ + return x; + else if (dexp > MANTLENGTH_DP64 + 1 || dexp < -MANTLENGTH_DP64 - 1) + /* One of x and y is insignificant compared to the other */ + return x + y; /* Raise inexact */ + else if (xexp > EXPBIAS_DP64 + 500 || yexp > EXPBIAS_DP64 + 500) + { + /* Danger of overflow; scale down by 2**600. */ + expadjust = 600; + ux -= 0x2580000000000000; + PUT_BITS_DP64(ux, x); + uy -= 0x2580000000000000; + PUT_BITS_DP64(uy, y); + } + else if (xexp < EXPBIAS_DP64 - 500 || yexp < EXPBIAS_DP64 - 500) + { + /* Danger of underflow; scale up by 2**600. */ + expadjust = -600; + if (xexp == 0) + { + /* x is denormal - handle by adding 601 to the exponent + and then subtracting a correction for the implicit bit */ + PUT_BITS_DP64(ux + 0x2590000000000000, x); + x -= 9.23297861778573578076e-128; /* 0x2590000000000000 */ + GET_BITS_DP64(x, ux); + } + else + { + /* x is normal - just increase the exponent by 600 */ + ux += 0x2580000000000000; + PUT_BITS_DP64(ux, x); + } + if (yexp == 0) + { + PUT_BITS_DP64(uy + 0x2590000000000000, y); + y -= 9.23297861778573578076e-128; /* 0x2590000000000000 */ + GET_BITS_DP64(y, uy); + } + else + { + uy += 0x2580000000000000; + PUT_BITS_DP64(uy, y); + } + } + + +#ifdef FAST_BUT_GREATER_THAN_ONE_ULP + /* Not awful, but results in accuracy loss larger than 1 ulp */ + r = x*x + y*y +#else + /* Slower but more accurate */ + + /* Sort so that x is greater than y */ + if (x < y) + { + u = y; + y = x; + x = u; + ut = ux; + ux = uy; + uy = ut; + } + + /* Split x into hx and tx, head and tail */ + PUT_BITS_DP64(ux & 0xfffffffff8000000, hx); + tx = x - hx; + + PUT_BITS_DP64(uy & 0xfffffffff8000000, hy); + ty = y - hy; + + /* Compute r = x*x + y*y with extra precision */ + x2 = x*x; + y2 = y*y; + hs = x2 + y2; + + if (dexp == 0) + /* We take most care when x and y have equal exponents, + i.e. are almost the same size */ + ts = (((x2 - hs) + y2) + + ((hx * hx - x2) + 2 * hx * tx) + tx * tx) + + ((hy * hy - y2) + 2 * hy * ty) + ty * ty; + else + ts = (((x2 - hs) + y2) + + ((hx * hx - x2) + 2 * hx * tx) + tx * tx); + + r = hs + ts; +#endif + + /* The sqrt can introduce another half ulp error. */ +#ifdef WINDOWS + /* VC++ intrinsic call */ + _mm_store_sd(&retval, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&r))); +#else + /* Hammer sqrt instruction */ + asm volatile ("sqrtsd %1, %0" : "=x" (retval) : "x" (r)); +#endif + + /* If necessary scale the result back. This may lead to + overflow but if so that's the correct result. */ + retval = scaleDouble_1(retval, expadjust); + + if (retval > large) + /* The result overflowed. Deal with errno. */ +#ifdef WINDOWS + return handle_error("hypot", PINFBITPATT_DP64, _OVERFLOW, + AMD_F_OVERFLOW | AMD_F_INEXACT, ERANGE, x, y); +#else + return retval_errno_erange_overflow(x, y); +#endif + + return retval; +} + +weak_alias (__hypot, hypot)
diff --git a/src/hypotf.c b/src/hypotf.c new file mode 100644 index 0000000..fcc09fc --- /dev/null +++ b/src/hypotf.c
@@ -0,0 +1,131 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#ifdef USE_SOFTWARE_SQRT +#define USE_SQRTF_AMD_INLINE +#endif +#define USE_INFINITYF_WITH_FLAGS +#define USE_HANDLE_ERRORF +#include "../inc/libm_inlines_amd.h" +#ifdef USE_SOFTWARE_SQRT +#undef USE_SQRTF_AMD_INLINE +#endif +#undef USE_INFINITYF_WITH_FLAGS +#undef USE_HANDLE_ERRORF + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range result */ +static inline float retval_errno_erange_overflow(float x, float y) +{ + struct exception exc; + exc.arg1 = (double)x; + exc.arg2 = (double)y; + exc.type = OVERFLOW; + exc.name = (char *)"hypotf"; + if (_LIB_VERSION == _SVID_) + exc.retval = HUGE; + else + exc.retval = infinityf_with_flags(AMD_F_OVERFLOW | AMD_F_INEXACT); + if (_LIB_VERSION == _POSIX_) + __set_errno(ERANGE); + else if (!matherr(&exc)) + __set_errno(ERANGE); + return exc.retval; +} +#endif + +#ifdef WINDOWS +float FN_PROTOTYPE(hypotf)(float x, float y) +#else +float FN_PROTOTYPE(hypotf)(float x, float y) +#endif +{ + /* Returns sqrt(x*x + y*y) with no overflow or underflow unless + the result warrants it */ + + /* Do intermediate computations in double precision + and use sqrt instruction from chip if available. */ + double dx = x, dy = y, dr, retval; + + /* The largest finite float, stored as a double */ + const double large = 3.40282346638528859812e+38; /* 0x47efffffe0000000 */ + + + unsigned long long ux, uy, avx, avy; + + GET_BITS_DP64(x, avx); + avx &= ~SIGNBIT_DP64; + GET_BITS_DP64(y, avy); + avy &= ~SIGNBIT_DP64; + ux = (avx >> EXPSHIFTBITS_DP64); + uy = (avy >> EXPSHIFTBITS_DP64); + + if (ux == BIASEDEMAX_DP64 + 1 || uy == BIASEDEMAX_DP64 + 1) + { + retval = x*x + y*y; + /* One or both of the arguments are NaN or infinity. The + result will also be NaN or infinity. */ + if (((ux == BIASEDEMAX_DP64 + 1) && !(avx & MANTBITS_DP64)) || + ((uy == BIASEDEMAX_DP64 + 1) && !(avy & MANTBITS_DP64))) + /* x or y is infinity. ISO C99 defines that we must + return +infinity, even if the other argument is NaN. + Note that the computation of x*x + y*y above will already + have raised invalid if either x or y is a signalling NaN. */ + return infinityf_with_flags(0); + else + /* One or both of x or y is NaN, and neither is infinity. + Raise invalid if it's a signalling NaN */ + return (float)retval; + } + + dr = (dx*dx + dy*dy); + +#if USE_SOFTWARE_SQRT + retval = sqrtf_amd_inline(r); +#else +#ifdef WINDOWS + /* VC++ intrinsic call */ + _mm_store_sd(&retval, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&dr))); +#else + /* Hammer sqrt instruction */ + asm volatile ("sqrtsd %1, %0" : "=x" (retval) : "x" (dr)); +#endif +#endif + + if (retval > large) +#ifdef WINDOWS + return handle_errorf("hypotf", PINFBITPATT_SP32, _OVERFLOW, + AMD_F_OVERFLOW | AMD_F_INEXACT, ERANGE, x, y); +#else + return retval_errno_erange_overflow(x, y); +#endif + else + return (float)retval; + } + +weak_alias (__hypotf, hypotf)
diff --git a/src/ilogb.c b/src/ilogb.c new file mode 100644 index 0000000..2c1cb7c --- /dev/null +++ b/src/ilogb.c
@@ -0,0 +1,99 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> +#include <limits.h> +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include "../inc/libm_special.h" + + +int FN_PROTOTYPE(ilogb)(double x) +{ + + + /* Check for input range */ + UT64 checkbits; + int expbits; + U64 manbits; + U64 zerovalue; + /* Clear the sign bit and check if the value is zero nan or inf.*/ + checkbits.f64=x; + zerovalue = (checkbits.u64 & ~SIGNBIT_DP64); + + if(zerovalue == 0) + { + /* Raise exception as the number zero*/ + __amd_handle_error(DOMAIN, EDOM, "ilogb", x, 0.0 ,(double)INT_MIN); + + + return INT_MIN; + } + + if( zerovalue == EXPBITS_DP64 ) + { + /* Raise exception as the number is inf */ + + __amd_handle_error(DOMAIN, EDOM, "ilogb", x, 0.0 ,(double)INT_MAX); + + return INT_MAX; + } + + if( zerovalue > EXPBITS_DP64 ) + { + /* Raise exception as the number is nan */ + __amd_handle_error(DOMAIN, EDOM, "ilogb", x, 0.0 ,(double)INT_MIN); + + + return INT_MIN; + } + + expbits = (int) (( checkbits.u64 << 1) >> 53); + + if(expbits == 0 && (checkbits.u64 & MANTBITS_DP64 )!= 0) + { + /* the value is denormalized */ + manbits = checkbits.u64 & MANTBITS_DP64; + expbits = EMIN_DP64; + while (manbits < IMPBIT_DP64) + { + manbits <<= 1; + expbits--; + } + } + else + { + + expbits-=EXPBIAS_DP64; + } + + + return expbits; +}
diff --git a/src/ilogbf.c b/src/ilogbf.c new file mode 100644 index 0000000..cb129e6 --- /dev/null +++ b/src/ilogbf.c
@@ -0,0 +1,109 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> +#include <limits.h> +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include "../inc/libm_special.h" + +int FN_PROTOTYPE(ilogbf)(float x) +{ + + /* Check for input range */ + UT32 checkbits; + int expbits; + U32 manbits; + U32 zerovalue; + checkbits.f32=x; + + /* Clear the sign bit and check if the value is zero nan or inf.*/ + zerovalue = (checkbits.u32 & ~SIGNBIT_SP32); + + if(zerovalue == 0) + { + /* Raise exception as the number zero*/ + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, EDOM, "ilogb", x, is_x_snan, 0.0f, 0, (float) INT_MIN, 0); + } + + return INT_MIN; + } + + if( zerovalue == EXPBITS_SP32 ) + { + /* Raise exception as the number is inf */ + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, EDOM, "ilogb", x, is_x_snan, 0.0f, 0, (float) INT_MAX, 0); + } + + return INT_MAX; + } + + if( zerovalue > EXPBITS_SP32 ) + { + /* Raise exception as the number is nan */ + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, EDOM, "ilogb", x, is_x_snan, 0.0f, 0, (float) INT_MIN, 0); + } + + return INT_MIN; + } + + expbits = (int) (( checkbits.u32 << 1) >> 24); + + if(expbits == 0 && (checkbits.u32 & MANTBITS_SP32 )!= 0) + { + /* the value is denormalized */ + manbits = checkbits.u32 & MANTBITS_SP32; + expbits = EMIN_SP32; + while (manbits < IMPBIT_SP32) + { + manbits <<= 1; + expbits--; + } + } + else + { + expbits-=EXPBIAS_SP32; + } + + + return expbits; +}
diff --git a/src/ldexp.c b/src/ldexp.c new file mode 100644 index 0000000..695118b --- /dev/null +++ b/src/ldexp.c
@@ -0,0 +1,117 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include "../inc/libm_special.h" + +double FN_PROTOTYPE(ldexp)(double x, int n) +{ + UT64 val; + unsigned int sign; + int exponent; + val.f64 = x; + sign = val.u32[1] & 0x80000000; + val.u32[1] = val.u32[1] & 0x7fffffff; /* remove the sign bit */ + + if((val.u32[1] & 0x7ff00000)== 0x7ff00000)/* x= nan or x = +-inf*/ + return x; + + if((val.u64 == 0x0000000000000000) || (n==0)) + return x; /* x= +-0 or n= 0*/ + + exponent = val.u32[1] >> 20; /* get the exponent */ + + if(exponent == 0)/*x is denormal*/ + { + val.f64 = val.f64 * VAL_2PMULTIPLIER_DP;/*multiply by 2^53 to bring it to the normal range*/ + exponent = val.u32[1] >> 20; /* get the exponent */ + exponent = exponent + n - MULTIPLIER_DP; + if(exponent < -MULTIPLIER_DP)/*underflow*/ + { + val.u32[1] = sign | 0x00000000; + val.u32[0] = 0x00000000; + __amd_handle_error(UNDERFLOW, ERANGE, "ldexp", x,(double)n ,val.f64); + + return val.f64; + } + if(exponent > 2046)/*overflow*/ + { + val.u32[1] = sign | 0x7ff00000; + val.u32[0] = 0x00000000; + __amd_handle_error(OVERFLOW, ERANGE, "ldexp", x,(double)n ,val.f64); + + return val.f64; + } + + exponent += MULTIPLIER_DP; + val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff); + val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP; + return val.f64; + } + + exponent += n; + + if(exponent < -MULTIPLIER_DP)/*underflow*/ + { + val.u32[1] = sign | 0x00000000; + val.u32[0] = 0x00000000; + + __amd_handle_error(UNDERFLOW, ERANGE, "ldexp", x,(double)n ,val.f64); + + + return val.f64; + } + + if(exponent < 1)/*x is normal but output is debnormal*/ + { + exponent += MULTIPLIER_DP; + val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff); + val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP; + return val.f64; + } + + if(exponent > 2046)/*overflow*/ + { + val.u32[1] = sign | 0x7ff00000; + val.u32[0] = 0x00000000; + __amd_handle_error(OVERFLOW, ERANGE, "ldexp", x,(double)n ,val.f64); + + + return val.f64; + } + + val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff); + return val.f64; +} + + +
diff --git a/src/ldexpf.c b/src/ldexpf.c new file mode 100644 index 0000000..892c6e9 --- /dev/null +++ b/src/ldexpf.c
@@ -0,0 +1,133 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#endif + +#include <math.h> +#include <errno.h> +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include "../inc/libm_special.h" + + +float FN_PROTOTYPE(ldexpf)(float x, int n) +{ + UT32 val; + unsigned int sign; + int exponent; + val.f32 = x; + sign = val.u32 & 0x80000000; + val.u32 = val.u32 & 0x7fffffff;/* remove the sign bit */ + + if((val.u32 & 0x7f800000)== 0x7f800000)/* x= nan or x = +-inf*/ + return x; + + if((val.u32 == 0x00000000) || (n==0))/* x= +-0 or n= 0*/ + return x; + + exponent = val.u32 >> 23; /* get the exponent */ + + if(exponent == 0)/*x is denormal*/ + { + val.f32 = val.f32 * VAL_2PMULTIPLIER_SP;/*multiply by 2^24 to bring it to the normal range*/ + exponent = (val.u32 >> 23); /* get the exponent */ + exponent = exponent + n - MULTIPLIER_SP; + if(exponent < -MULTIPLIER_SP)/*underflow*/ + { + val.u32 = sign | 0x00000000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(UNDERFLOW, ERANGE, "ldexpf", x, is_x_snan, (float)n , 0,val.f32, 0); + } + + return val.f32; + } + if(exponent > 254)/*overflow*/ + { + val.u32 = sign | 0x7f800000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(OVERFLOW, ERANGE, "ldexpf", x, is_x_snan, (float)n , 0,val.f32, 0); + } + + + return val.f32; + } + + exponent += MULTIPLIER_SP; + val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff); + val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP; + return val.f32; + } + + exponent += n; + + if(exponent < -MULTIPLIER_SP)/*underflow*/ + { + val.u32 = sign | 0x00000000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(UNDERFLOW, ERANGE, "ldexpf", x, is_x_snan, (float)n , 0,val.f32, 0); + } + + return val.f32; + } + + if(exponent < 1)/*x is normal but output is debnormal*/ + { + exponent += MULTIPLIER_SP; + val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff); + val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP; + return val.f32; + } + + if(exponent > 254)/*overflow*/ + { + val.u32 = sign | 0x7f800000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(OVERFLOW, ERANGE, "ldexpf", x, is_x_snan, (float)n , 0,val.f32, 0); + } + + return val.f32; + } + + val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);/*x is normal and output is normal*/ + return val.f32; +} +
diff --git a/src/libm_special.c b/src/libm_special.c new file mode 100644 index 0000000..974d99b --- /dev/null +++ b/src/libm_special.c
@@ -0,0 +1,117 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + +#ifdef __x86_64__ + +#include <emmintrin.h> +#include <math.h> +#include <errno.h> + +#include "../inc/libm_util_amd.h" +#include "../inc/libm_special.h" + +#ifdef WIN64 +#define EXCEPTION_S _exception +#else +#define EXCEPTION_S exception +#endif + + + +static double convert_snan_32to64(float x) +{ + U64 t; + UT32 xs; + UT64 xb; + + xs.f32 = x; + xb.u64 = (((xs.u32 & SIGNBIT_SP32) == SIGNBIT_SP32) ? NINFBITPATT_DP64 : EXPBITS_DP64); + + t = 0; + t = (xs.u32 & MANTBITS_SP32); + t = (t << 29); // 29 = (52-23) + xb.u64 = (xb.u64 | t); + + return xb.f64; +} + +#ifdef NEED_FAKE_MATHERR +int +matherr (struct exception *s) +{ + return 0; +} +#endif + +void __amd_handle_errorf(int type, int error, const char *name, + float arg1, unsigned int arg1_is_snan, + float arg2, unsigned int arg2_is_snan, + float retval, unsigned int retval_is_snan) +{ + struct EXCEPTION_S exception_data; + + // write exception info + exception_data.type = type; + exception_data.name = (char*)name; + + // sNaN float to double conversion can trigger interrupt + // handle them specially + + if(arg1_is_snan) { exception_data.arg1 = convert_snan_32to64(arg1); } + else { exception_data.arg1 = (double)arg1; } + + if(arg2_is_snan) { exception_data.arg2 = convert_snan_32to64(arg2); } + else { exception_data.arg2 = (double)arg2; } + + if(retval_is_snan) { exception_data.retval = convert_snan_32to64(retval); } + else { exception_data.retval = (double)retval; } + + // call matherr, set errno if matherr returns 0 + if(!matherr(&exception_data)) + { + errno = error; + } +} + +void __amd_handle_error(int type, int error, const char *name, + double arg1, + double arg2, + double retval) +{ + struct EXCEPTION_S exception_data; + + // write exception info + exception_data.type = type; + exception_data.name = (char*)name; + + exception_data.arg1 = arg1; + exception_data.arg2 = arg2; + exception_data.retval = retval; + + // call matherr, set errno if matherr returns 0 + if(!matherr(&exception_data)) + { + errno = error; + } +} + +#endif /* __x86_64__ */
diff --git a/src/llrint.c b/src/llrint.c new file mode 100644 index 0000000..5f96115 --- /dev/null +++ b/src/llrint.c
@@ -0,0 +1,62 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> + +#include "libm_amd.h" +#include "libm_util_amd.h" +#include "libm_special.h" + + +long long int FN_PROTOTYPE(llrint)(double x) +{ + + + UT64 checkbits,val_2p52; + checkbits.f64=x; + + /* Clear the sign bit and check if the value can be rounded */ + + if( (checkbits.u64 & 0x7FFFFFFFFFFFFFFF) > 0x4330000000000000) + { + /* number cant be rounded raise an exception */ + /* Number exceeds the representable range could be nan or inf also*/ + __amd_handle_error(DOMAIN, EDOM, "llrint", x,0.0 ,(double)x); + + return (long long int) x; + } + + val_2p52.u32[1] = (checkbits.u32[1] & 0x80000000) | 0x43300000; + val_2p52.u32[0] = 0; + + + /* Add and sub 2^52 to round the number according to the current rounding direction */ + + return (long long int) ((x + val_2p52.f64) - val_2p52.f64); +}
diff --git a/src/llrintf.c b/src/llrintf.c new file mode 100644 index 0000000..509e46b --- /dev/null +++ b/src/llrintf.c
@@ -0,0 +1,67 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> + +#include "libm_amd.h" +#include "libm_util_amd.h" +#include "libm_special.h" + + + +long long int FN_PROTOTYPE(llrintf)(float x) +{ + + UT32 checkbits,val_2p23; + checkbits.f32=x; + + /* Clear the sign bit and check if the value can be rounded */ + + if( (checkbits.u32 & 0x7FFFFFFF) > 0x4B000000) + { + /* number cant be rounded raise an exception */ + /* Number exceeds the representable range could be nan or inf also*/ + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, EDOM, "llrintf", x, is_x_snan, 0.0F , 0,(float)x, 0); + } + + return (long long int) x; + } + + + val_2p23.u32 = (checkbits.u32 & 0x80000000) | 0x4B000000; + + /* Add and sub 2^23 to round the number according to the current rounding direction */ + + return (long long int) ((x + val_2p23.f32) - val_2p23.f32); +}
diff --git a/src/llround.c b/src/llround.c new file mode 100644 index 0000000..0b582c2 --- /dev/null +++ b/src/llround.c
@@ -0,0 +1,112 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include "../inc/libm_special.h" + +#ifdef WINDOWS +/*In windows llong long int is 64 bit and long int is 32 bit. + In Linux long long int and long int both are of size 64 bit*/ +long long int FN_PROTOTYPE(llround)(double d) +{ + UT64 u64d; + UT64 u64Temp,u64result; + int intexp, shift; + U64 sign; + long long int result; + + u64d.f64 = u64Temp.f64 = d; + + if ((u64d.u32[1] & 0X7FF00000) == 0x7FF00000) + { + /*else the number is infinity*/ + //Got to raise range or domain error + __amd_handle_error(DOMAIN, EDOM, "llround", d, 0.0 , (double)SIGNBIT_DP64); + return SIGNBIT_DP64; /*GCC returns this when the number is out of range*/ + } + + u64Temp.u32[1] &= 0x7FFFFFFF; + intexp = (u64d.u32[1] & 0x7FF00000) >> 20; + sign = u64d.u64 & 0x8000000000000000; + intexp -= 0x3FF; + + /* 1.0 x 2^-1 is the smallest number which can be rounded to 1 */ + if (intexp < -1) + return (0); + + /* 1.0 x 2^31 (or 2^63) is already too large */ + if (intexp >= 63) + { + /*Based on the sign of the input value return the MAX and MIN*/ + result = 0x8000000000000000; /*Return LONG MIN*/ + __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double) result); + + return result; + } + + u64result.f64 = u64Temp.f64; + /* >= 2^52 is already an exact integer */ + if (intexp < 52) + { + /* add 0.5, extraction below will truncate */ + u64result.f64 = u64Temp.f64 + 0.5; + } + + intexp = ((u64result.u32[1] >> 20) & 0x7ff) - 0x3FF; + + u64result.u32[1] &= 0xfffff; + u64result.u32[1] |= 0x00100000; /*Mask the last exp bit to 1*/ + shift = intexp - 52; + + if(shift < 0) + u64result.u64 = u64result.u64 >> (-shift); + if(shift > 0) + u64result.u64 = u64result.u64 << (shift); + + result = u64result.u64; + + if (sign) + result = -result; + + return result; +} + +#else //WINDOWS +/*llroundf is equivalent to the linux implementation of + lroundf. Both long int and long long int are of the same size*/ +long long int FN_PROTOTYPE(llround)(double d) +{ + long long int result; + result = FN_PROTOTYPE(lround)(d); + return result; +} +#endif
diff --git a/src/llroundf.c b/src/llroundf.c new file mode 100644 index 0000000..0e1ac8a --- /dev/null +++ b/src/llroundf.c
@@ -0,0 +1,132 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include "../inc/libm_special.h" + +#ifdef WINDOWS +/*In windows llong long int is 64 bit and long int is 32 bit. + In Linux long long int and long int both are of size 64 bit*/ +long long int FN_PROTOTYPE(llroundf)(float f) +{ + UT32 u32d; + UT32 u32Temp,u32result; + int intexp, shift; + U32 sign; + long long int result; + + u32d.f32 = u32Temp.f32 = f; + if ((u32d.u32 & 0X7F800000) == 0x7F800000) + { + /*else the number is infinity*/ + //Got to raise range or domain error + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = f; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, EDOM, "llroundf", f, is_x_snan, 0.0F , 0,(float)SIGNBIT_DP64, 0); + return SIGNBIT_DP64; /*GCC returns this when the number is out of range*/ + } + + } + + u32Temp.u32 &= 0x7FFFFFFF; + intexp = (u32d.u32 & 0x7F800000) >> 23; + sign = u32d.u32 & 0x80000000; + intexp -= 0x7F; + + + /* 1.0 x 2^-1 is the smallest number which can be rounded to 1 */ + if (intexp < -1) + return (0); + + + /* 1.0 x 2^31 (or 2^63) is already too large */ + if (intexp >= 63) + { + result = 0x8000000000000000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = f; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)result, 0); + } + + return result; + } + + u32result.f32 = u32Temp.f32; + + /* >= 2^52 is already an exact integer */ + if (intexp < 23) + { + /* add 0.5, extraction below will truncate */ + u32result.f32 = u32Temp.f32 + 0.5F; + } + intexp = (u32result.u32 & 0x7f800000) >> 23; + intexp -= 0x7f; + u32result.u32 &= 0x7fffff; + u32result.u32 |= 0x00800000; + + result = u32result.u32; + + /*Since float is only 32 bit for higher accuracy we shift the result by 32 bits + * In the next step we shift an extra 32 bits in the reverse direction based + * on the value of intexp*/ + result = result << 32; + shift = intexp - 55; /*55= 23 +32*/ + + + if(shift < 0) + result = result >> (-shift); + if(shift > 0) + result = result << (shift); + + if (sign) + result = -result; + return result; + +} + +#else //WINDOWS +/*llroundf is equivalent to the linux implementation of + lroundf. Both long int and long long int are of the same size*/ +long long int FN_PROTOTYPE(llroundf)(float f) +{ + long long int result; + result = FN_PROTOTYPE(lroundf)(f); + return result; + +} +#endif +
diff --git a/src/log1p.c b/src/log1p.c new file mode 100644 index 0000000..b7cd097 --- /dev/null +++ b/src/log1p.c
@@ -0,0 +1,475 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_NAN_WITH_FLAGS +#define USE_VAL_WITH_FLAGS +#define USE_INFINITY_WITH_FLAGS +#define USE_HANDLE_ERROR +#include "../inc/libm_inlines_amd.h" +#undef USE_NAN_WITH_FLAGS +#undef USE_VAL_WITH_FLAGS +#undef USE_INFINITY_WITH_FLAGS +#undef USE_HANDLE_ERROR + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range result */ +static inline double retval_errno_erange_overflow(double x) +{ + struct exception exc; + exc.arg1 = x; + exc.arg2 = x; + exc.type = SING; + exc.name = (char *)"log1p"; + if (_LIB_VERSION == _SVID_) + exc.retval = -HUGE; + else + exc.retval = -infinity_with_flags(AMD_F_DIVBYZERO); + if (_LIB_VERSION == _POSIX_) + __set_errno(ERANGE); + else if (!matherr(&exc)) + __set_errno(ERANGE); + return exc.retval; +} + +/* Deal with errno for out-of-range argument */ +static inline double retval_errno_edom(double x) +{ + struct exception exc; + exc.arg1 = x; + exc.arg2 = x; + exc.type = DOMAIN; + exc.name = (char *)"log1p"; + if (_LIB_VERSION == _SVID_) + exc.retval = -HUGE; + else + exc.retval = nan_with_flags(AMD_F_INVALID); + if (_LIB_VERSION == _POSIX_) + __set_errno(EDOM); + else if (!matherr(&exc)) + { + if(_LIB_VERSION == _SVID_) + (void)fputs("log1p: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#undef _FUNCNAME +#define _FUNCNAME "log1p" + +double FN_PROTOTYPE(log1p)(double x) +{ + + int xexp; + double r, r1, r2, correction, f, f1, f2, q, u, v, z1, z2, poly, m2; + int index; + unsigned long long ux, ax; + + /* + Computes natural log(1+x). Algorithm based on: + Ping-Tak Peter Tang + "Table-driven implementation of the logarithm function in IEEE + floating-point arithmetic" + ACM Transactions on Mathematical Software (TOMS) + Volume 16, Issue 4 (December 1990) + Note that we use a lookup table of size 64 rather than 128, + and compensate by having extra terms in the minimax polynomial + for the kernel approximation. + */ + +/* Arrays ln_lead_table and ln_tail_table contain + leading and trailing parts respectively of precomputed + values of natural log(1+i/64), for i = 0, 1, ..., 64. + ln_lead_table contains the first 24 bits of precision, + and ln_tail_table contains a further 53 bits precision. */ + + static const double ln_lead_table[65] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 1.55041813850402832031e-02, /* 0x3f8fc0a800000000 */ + 3.07716131210327148438e-02, /* 0x3f9f829800000000 */ + 4.58095073699951171875e-02, /* 0x3fa7745800000000 */ + 6.06245994567871093750e-02, /* 0x3faf0a3000000000 */ + 7.52233862876892089844e-02, /* 0x3fb341d700000000 */ + 8.96121263504028320312e-02, /* 0x3fb6f0d200000000 */ + 1.03796780109405517578e-01, /* 0x3fba926d00000000 */ + 1.17783010005950927734e-01, /* 0x3fbe270700000000 */ + 1.31576299667358398438e-01, /* 0x3fc0d77e00000000 */ + 1.45181953907012939453e-01, /* 0x3fc2955280000000 */ + 1.58604979515075683594e-01, /* 0x3fc44d2b00000000 */ + 1.71850204467773437500e-01, /* 0x3fc5ff3000000000 */ + 1.84922337532043457031e-01, /* 0x3fc7ab8900000000 */ + 1.97825729846954345703e-01, /* 0x3fc9525a80000000 */ + 2.10564732551574707031e-01, /* 0x3fcaf3c900000000 */ + 2.23143517971038818359e-01, /* 0x3fcc8ff780000000 */ + 2.35566020011901855469e-01, /* 0x3fce270700000000 */ + 2.47836112976074218750e-01, /* 0x3fcfb91800000000 */ + 2.59957492351531982422e-01, /* 0x3fd0a324c0000000 */ + 2.71933674812316894531e-01, /* 0x3fd1675c80000000 */ + 2.83768117427825927734e-01, /* 0x3fd22941c0000000 */ + 2.95464158058166503906e-01, /* 0x3fd2e8e280000000 */ + 3.07025015354156494141e-01, /* 0x3fd3a64c40000000 */ + 3.18453729152679443359e-01, /* 0x3fd4618bc0000000 */ + 3.29753279685974121094e-01, /* 0x3fd51aad80000000 */ + 3.40926527976989746094e-01, /* 0x3fd5d1bd80000000 */ + 3.51976394653320312500e-01, /* 0x3fd686c800000000 */ + 3.62905442714691162109e-01, /* 0x3fd739d7c0000000 */ + 3.73716354370117187500e-01, /* 0x3fd7eaf800000000 */ + 3.84411692619323730469e-01, /* 0x3fd89a3380000000 */ + 3.94993782043457031250e-01, /* 0x3fd9479400000000 */ + 4.05465066432952880859e-01, /* 0x3fd9f323c0000000 */ + 4.15827870368957519531e-01, /* 0x3fda9cec80000000 */ + 4.26084339618682861328e-01, /* 0x3fdb44f740000000 */ + 4.36236739158630371094e-01, /* 0x3fdbeb4d80000000 */ + 4.46287095546722412109e-01, /* 0x3fdc8ff7c0000000 */ + 4.56237375736236572266e-01, /* 0x3fdd32fe40000000 */ + 4.66089725494384765625e-01, /* 0x3fddd46a00000000 */ + 4.75845873355865478516e-01, /* 0x3fde744240000000 */ + 4.85507786273956298828e-01, /* 0x3fdf128f40000000 */ + 4.95077252388000488281e-01, /* 0x3fdfaf5880000000 */ + 5.04556000232696533203e-01, /* 0x3fe02552a0000000 */ + 5.13945698738098144531e-01, /* 0x3fe0723e40000000 */ + 5.23248136043548583984e-01, /* 0x3fe0be72e0000000 */ + 5.32464742660522460938e-01, /* 0x3fe109f380000000 */ + 5.41597247123718261719e-01, /* 0x3fe154c3c0000000 */ + 5.50647079944610595703e-01, /* 0x3fe19ee6a0000000 */ + 5.59615731239318847656e-01, /* 0x3fe1e85f40000000 */ + 5.68504691123962402344e-01, /* 0x3fe23130c0000000 */ + 5.77315330505371093750e-01, /* 0x3fe2795e00000000 */ + 5.86049020290374755859e-01, /* 0x3fe2c0e9e0000000 */ + 5.94707071781158447266e-01, /* 0x3fe307d720000000 */ + 6.03290796279907226562e-01, /* 0x3fe34e2880000000 */ + 6.11801505088806152344e-01, /* 0x3fe393e0c0000000 */ + 6.20240390300750732422e-01, /* 0x3fe3d90260000000 */ + 6.28608644008636474609e-01, /* 0x3fe41d8fe0000000 */ + 6.36907458305358886719e-01, /* 0x3fe4618bc0000000 */ + 6.45137906074523925781e-01, /* 0x3fe4a4f840000000 */ + 6.53301239013671875000e-01, /* 0x3fe4e7d800000000 */ + 6.61398470401763916016e-01, /* 0x3fe52a2d20000000 */ + 6.69430613517761230469e-01, /* 0x3fe56bf9c0000000 */ + 6.77398800849914550781e-01, /* 0x3fe5ad4040000000 */ + 6.85303986072540283203e-01, /* 0x3fe5ee02a0000000 */ + 6.93147122859954833984e-01}; /* 0x3fe62e42e0000000 */ + + static const double ln_tail_table[65] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 5.15092497094772879206e-09, /* 0x3e361f807c79f3db */ + 4.55457209735272790188e-08, /* 0x3e6873c1980267c8 */ + 2.86612990859791781788e-08, /* 0x3e5ec65b9f88c69e */ + 2.23596477332056055352e-08, /* 0x3e58022c54cc2f99 */ + 3.49498983167142274770e-08, /* 0x3e62c37a3a125330 */ + 3.23392843005887000414e-08, /* 0x3e615cad69737c93 */ + 1.35722380472479366661e-08, /* 0x3e4d256ab1b285e9 */ + 2.56504325268044191098e-08, /* 0x3e5b8abcb97a7aa2 */ + 5.81213608741512136843e-08, /* 0x3e6f34239659a5dc */ + 5.59374849578288093334e-08, /* 0x3e6e07fd48d30177 */ + 5.06615629004996189970e-08, /* 0x3e6b32df4799f4f6 */ + 5.24588857848400955725e-08, /* 0x3e6c29e4f4f21cf8 */ + 9.61968535632653505972e-10, /* 0x3e1086c848df1b59 */ + 1.34829655346594463137e-08, /* 0x3e4cf456b4764130 */ + 3.65557749306383026498e-08, /* 0x3e63a02ffcb63398 */ + 3.33431709374069198903e-08, /* 0x3e61e6a6886b0976 */ + 5.13008650536088382197e-08, /* 0x3e6b8abcb97a7aa2 */ + 5.09285070380306053751e-08, /* 0x3e6b578f8aa35552 */ + 3.20853940845502057341e-08, /* 0x3e6139c871afb9fc */ + 4.06713248643004200446e-08, /* 0x3e65d5d30701ce64 */ + 5.57028186706125221168e-08, /* 0x3e6de7bcb2d12142 */ + 5.48356693724804282546e-08, /* 0x3e6d708e984e1664 */ + 1.99407553679345001938e-08, /* 0x3e556945e9c72f36 */ + 1.96585517245087232086e-09, /* 0x3e20e2f613e85bda */ + 6.68649386072067321503e-09, /* 0x3e3cb7e0b42724f6 */ + 5.89936034642113390002e-08, /* 0x3e6fac04e52846c7 */ + 2.85038578721554472484e-08, /* 0x3e5e9b14aec442be */ + 5.09746772910284482606e-08, /* 0x3e6b5de8034e7126 */ + 5.54234668933210171467e-08, /* 0x3e6dc157e1b259d3 */ + 6.29100830926604004874e-09, /* 0x3e3b05096ad69c62 */ + 2.61974119468563937716e-08, /* 0x3e5c2116faba4cdd */ + 4.16752115011186398935e-08, /* 0x3e665fcc25f95b47 */ + 2.47747534460820790327e-08, /* 0x3e5a9a08498d4850 */ + 5.56922172017964209793e-08, /* 0x3e6de647b1465f77 */ + 2.76162876992552906035e-08, /* 0x3e5da71b7bf7861d */ + 7.08169709942321478061e-09, /* 0x3e3e6a6886b09760 */ + 5.77453510221151779025e-08, /* 0x3e6f0075eab0ef64 */ + 4.43021445893361960146e-09, /* 0x3e33071282fb989b */ + 3.15140984357495864573e-08, /* 0x3e60eb43c3f1bed2 */ + 2.95077445089736670973e-08, /* 0x3e5faf06ecb35c84 */ + 1.44098510263167149349e-08, /* 0x3e4ef1e63db35f68 */ + 1.05196987538551827693e-08, /* 0x3e469743fb1a71a5 */ + 5.23641361722697546261e-08, /* 0x3e6c1cdf404e5796 */ + 7.72099925253243069458e-09, /* 0x3e4094aa0ada625e */ + 5.62089493829364197156e-08, /* 0x3e6e2d4c96fde3ec */ + 3.53090261098577946927e-08, /* 0x3e62f4d5e9a98f34 */ + 3.80080516835568242269e-08, /* 0x3e6467c96ecc5cbe */ + 5.66961038386146408282e-08, /* 0x3e6e7040d03dec5a */ + 4.42287063097349852717e-08, /* 0x3e67bebf4282de36 */ + 3.45294525105681104660e-08, /* 0x3e6289b11aeb783f */ + 2.47132034530447431509e-08, /* 0x3e5a891d1772f538 */ + 3.59655343422487209774e-08, /* 0x3e634f10be1fb591 */ + 5.51581770357780862071e-08, /* 0x3e6d9ce1d316eb93 */ + 3.60171867511861372793e-08, /* 0x3e63562a19a9c442 */ + 1.94511067964296180547e-08, /* 0x3e54e2adf548084c */ + 1.54137376631349347838e-08, /* 0x3e508ce55cc8c97a */ + 3.93171034490174464173e-09, /* 0x3e30e2f613e85bda */ + 5.52990607758839766440e-08, /* 0x3e6db03ebb0227bf */ + 3.29990737637586136511e-08, /* 0x3e61b75bb09cb098 */ + 1.18436010922446096216e-08, /* 0x3e496f16abb9df22 */ + 4.04248680368301346709e-08, /* 0x3e65b3f399411c62 */ + 2.27418915900284316293e-08, /* 0x3e586b3e59f65355 */ + 1.70263791333409206020e-08, /* 0x3e52482ceae1ac12 */ + 5.76999904754328540596e-08}; /* 0x3e6efa39ef35793c */ + + /* log2_lead and log2_tail sum to an extra-precise version + of log(2) */ + static const double + log2_lead = 6.93147122859954833984e-01, /* 0x3fe62e42e0000000 */ + log2_tail = 5.76999904754328540596e-08; /* 0x3e6efa39ef35793c */ + + static const double + /* Approximating polynomial coefficients for x near 0.0 */ + ca_1 = 8.33333333333317923934e-02, /* 0x3fb55555555554e6 */ + ca_2 = 1.25000000037717509602e-02, /* 0x3f89999999bac6d4 */ + ca_3 = 2.23213998791944806202e-03, /* 0x3f62492307f1519f */ + ca_4 = 4.34887777707614552256e-04, /* 0x3f3c8034c85dfff0 */ + + /* Approximating polynomial coefficients for other x */ + cb_1 = 8.33333333333333593622e-02, /* 0x3fb5555555555557 */ + cb_2 = 1.24999999978138668903e-02, /* 0x3f89999999865ede */ + cb_3 = 2.23219810758559851206e-03; /* 0x3f6249423bd94741 */ + + /* The values exp(-1/16)-1 and exp(1/16)-1 */ + static const double + log1p_thresh1 = -6.05869371865242201114e-02, /* 0xbfaf0540438fd5c4 */ + log1p_thresh2 = 6.44944589178594318568e-02; /* 0x3fb082b577d34ed8 */ + + + GET_BITS_DP64(x, ux); + ax = ux & ~SIGNBIT_DP64; + + if ((ux & EXPBITS_DP64) == EXPBITS_DP64) + { + /* x is either NaN or infinity */ + if (ux & MANTBITS_DP64) + { + /* x is NaN */ +#ifdef WINDOWS + return handle_error(_FUNCNAME, ux|0x0008000000000000, _DOMAIN, + 0, EDOM, x, 0.0); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + } + else + { + /* x is infinity */ + if (ux & SIGNBIT_DP64) + /* x is negative infinity. Return a NaN. */ +#ifdef WINDOWS + return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return retval_errno_edom(x); +#endif + else + return x; + } + } + else if (ux >= 0xbff0000000000000) + { + /* x <= -1.0 */ + if (ux > 0xbff0000000000000) + { + /* x is less than -1.0. Return a NaN. */ +#ifdef WINDOWS + return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return retval_errno_edom(x); +#endif + } + else + { + /* x is exactly -1.0. Return -infinity with div-by-zero flag. */ +#ifdef WINDOWS + return handle_error(_FUNCNAME, NINFBITPATT_DP64, _SING, + AMD_F_DIVBYZERO, ERANGE, x, 0.0); +#else + return retval_errno_erange_overflow(x); +#endif + } + } + else if (ax < 0x3ca0000000000000) + { + if (ax == 0x0000000000000000) + { + /* x is +/-zero. Return the same zero. */ + return x; + } + else + /* abs(x) is less than epsilon. Return x with inexact. */ + return val_with_flags(x, AMD_F_INEXACT); + } + + + if (x < log1p_thresh1 || x > log1p_thresh2) + { + /* x is outside the range [exp(-1/16)-1, exp(1/16)-1] */ + /* + First, we decompose the argument x to the form + 1 + x = 2**M * (F1 + F2), + where 1 <= F1+F2 < 2, M has the value of an integer, + F1 = 1 + j/64, j ranges from 0 to 64, and |F2| <= 1/128. + + Second, we approximate log( 1 + F2/F1 ) by an odd polynomial + in U, where U = 2 F2 / (2 F1 + F2). + Note that log( 1 + F2/F1 ) = log( 1 + U/2 ) - log( 1 - U/2 ). + The core approximation calculates + Poly = [log( 1 + U/2 ) - log( 1 - U/2 )]/U - 1. + Note that log(1 + U/2) - log(1 - U/2) = 2 arctanh ( U/2 ), + thus, Poly = 2 arctanh( U/2 ) / U - 1. + + It is not hard to see that + log(x) = M*log(2) + log(F1) + log( 1 + F2/F1 ). + Hence, we return Z1 = log(F1), and Z2 = log( 1 + F2/F1). + The values of log(F1) are calculated beforehand and stored + in the program. + */ + + f = 1.0 + x; + GET_BITS_DP64(f, ux); + + /* Store the exponent of x in xexp and put + f into the range [1.0,2.0) */ + xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64; + PUT_BITS_DP64((ux & MANTBITS_DP64) | ONEEXPBITS_DP64, f); + + /* Now (1+x) = 2**(xexp) * f, 1 <= f < 2. */ + + /* Set index to be the nearest integer to 64*f */ + /* 64 <= index <= 128 */ + /* + r = 64.0 * f; + index = (int)(r + 0.5); + */ + /* This code instead of the above can save several cycles. + It only works because 64 <= r < 128, so + the nearest integer is always contained in exactly + 7 bits, and the right shift is always the same. */ + index = (int)((((ux & 0x000fc00000000000) | 0x0010000000000000) >> 46) + + ((ux & 0x0000200000000000) >> 45)); + + f1 = index * 0.015625; /* 0.015625 = 1/64 */ + index -= 64; + + /* Now take great care to compute f2 such that f1 + f2 = f */ + if (xexp <= -2 || xexp >= MANTLENGTH_DP64 + 8) + { + f2 = f - f1; + } + else + { + /* Create the number m2 = 2.0^(-xexp) */ + ux = (unsigned long long)(0x3ff - xexp) << EXPSHIFTBITS_DP64; + PUT_BITS_DP64(ux,m2); + if (xexp <= MANTLENGTH_DP64 - 1) + { + f2 = (m2 - f1) + m2*x; + } + else + { + f2 = (m2*x - f1) + m2; + } + } + + /* At this point, x = 2**xexp * ( f1 + f2 ) where + f1 = j/64, j = 1, 2, ..., 64 and |f2| <= 1/128. */ + + z1 = ln_lead_table[index]; + q = ln_tail_table[index]; + + /* Calculate u = 2 f2 / ( 2 f1 + f2 ) = f2 / ( f1 + 0.5*f2 ) */ + u = f2 / (f1 + 0.5 * f2); + + /* Here, |u| <= 2(exp(1/16)-1) / (exp(1/16)+1). + The core approximation calculates + poly = [log(1 + u/2) - log(1 - u/2)]/u - 1 */ + v = u * u; + poly = (v * (cb_1 + v * (cb_2 + v * cb_3))); + z2 = q + (u + u * poly); + + /* Now z1,z2 is an extra-precise approximation of log(f). */ + + /* Add xexp * log(2) to z1,z2 to get the result log(1+x). + The computed r1 is not subject to rounding error because + xexp has at most 10 significant bits, log(2) has 24 significant + bits, and z1 has up to 24 bits; and the exponents of z1 + and z2 differ by at most 6. */ + r1 = (xexp * log2_lead + z1); + r2 = (xexp * log2_tail + z2); + /* Natural log(1+x) */ + return r1 + r2; + } + else + { + /* Arguments close to 0.0 are handled separately to maintain + accuracy. + + The approximation in this region exploits the identity + log( 1 + r ) = log( 1 + u/2 ) - log( 1 - u/2 ), where + u = 2r / (2+r). + Note that the right hand side has an odd Taylor series expansion + which converges much faster than the Taylor series expansion of + log( 1 + r ) in r. Thus, we approximate log( 1 + r ) by + u + A1 * u^3 + A2 * u^5 + ... + An * u^(2n+1). + + One subtlety is that since u cannot be calculated from + r exactly, the rounding error in the first u should be + avoided if possible. To accomplish this, we observe that + u = r - r*r/(2+r). + Since x (=r) is the input argument, and thus presumed exact, + the formula above approximates u accurately because + u = r - correction, + and the magnitude of "correction" (of the order of r*r) + is small. + With these observations, we will approximate log( 1 + r ) by + r + ( (A1*u^3 + ... + An*u^(2n+1)) - correction ). + + We approximate log(1+r) by an odd polynomial in u, where + u = 2r/(2+r) = r - r*r/(2+r). + */ + r = x; + u = r / (2.0 + r); + correction = r * u; + u = u + u; + v = u * u; + r1 = r; + r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction); + return r1 + r2; + } +} + +weak_alias (__log1p, log1p)
diff --git a/src/log1pf.c b/src/log1pf.c new file mode 100644 index 0000000..375a846 --- /dev/null +++ b/src/log1pf.c
@@ -0,0 +1,416 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_NANF_WITH_FLAGS +#define USE_VALF_WITH_FLAGS +#define USE_INFINITYF_WITH_FLAGS +#define USE_HANDLE_ERRORF +#include "../inc/libm_inlines_amd.h" +#undef USE_NANF_WITH_FLAGS +#undef USE_VALF_WITH_FLAGS +#undef USE_INFINITYF_WITH_FLAGS +#undef USE_HANDLE_ERRORF + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range result */ +static inline float retval_errno_erange_overflow(float x) +{ + struct exception exc; + exc.arg1 = (double)x; + exc.arg2 = (double)x; + exc.type = SING; + exc.name = (char *)"log1pf"; + if (_LIB_VERSION == _SVID_) + exc.retval = -HUGE; + else + exc.retval = -infinityf_with_flags(AMD_F_DIVBYZERO); + if (_LIB_VERSION == _POSIX_) + __set_errno(ERANGE); + else if (!matherr(&exc)) + __set_errno(ERANGE); + return exc.retval; +} + +/* Deal with errno for out-of-range argument */ +static inline float retval_errno_edom(float x) +{ + struct exception exc; + exc.arg1 = (double)x; + exc.arg2 = (double)x; + exc.type = DOMAIN; + exc.name = (char *)"log1pf"; + if (_LIB_VERSION == _SVID_) + exc.retval = -HUGE; + else + exc.retval = nanf_with_flags(AMD_F_INVALID); + if (_LIB_VERSION == _POSIX_) + __set_errno(EDOM); + else if (!matherr(&exc)) + { + if(_LIB_VERSION == _SVID_) + (void)fputs("log1pf: DOMAIN error\n", stderr); + __set_errno(EDOM); + } + return exc.retval; +} +#endif + +#undef _FUNCNAME +#define _FUNCNAME "log1pf" + +float FN_PROTOTYPE(log1pf)(float x) +{ + + int xexp; + double dx, r, f, f1, f2, q, u, v, z1, z2, poly, m2; + int index; + unsigned int ux, ax; + unsigned long long lux; + + /* + Computes natural log(1+x) for float arguments. Algorithm is + basically a promotion of the arguments to double followed + by an inlined version of the double algorithm, simplified + for efficiency (see log1p_amd.c). Simplifications include: + * Special algorithm for arguments near 0.0 not required + * Scaling of denormalised arguments not required + * Shorter core series approximations used + Note that we use a lookup table of size 64 rather than 128, + and compensate by having extra terms in the minimax polynomial + for the kernel approximation. + */ + +/* Arrays ln_lead_table and ln_tail_table contain + leading and trailing parts respectively of precomputed + values of natural log(1+i/64), for i = 0, 1, ..., 64. + ln_lead_table contains the first 24 bits of precision, + and ln_tail_table contains a further 53 bits precision. */ + + static const double ln_lead_table[65] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 1.55041813850402832031e-02, /* 0x3f8fc0a800000000 */ + 3.07716131210327148438e-02, /* 0x3f9f829800000000 */ + 4.58095073699951171875e-02, /* 0x3fa7745800000000 */ + 6.06245994567871093750e-02, /* 0x3faf0a3000000000 */ + 7.52233862876892089844e-02, /* 0x3fb341d700000000 */ + 8.96121263504028320312e-02, /* 0x3fb6f0d200000000 */ + 1.03796780109405517578e-01, /* 0x3fba926d00000000 */ + 1.17783010005950927734e-01, /* 0x3fbe270700000000 */ + 1.31576299667358398438e-01, /* 0x3fc0d77e00000000 */ + 1.45181953907012939453e-01, /* 0x3fc2955280000000 */ + 1.58604979515075683594e-01, /* 0x3fc44d2b00000000 */ + 1.71850204467773437500e-01, /* 0x3fc5ff3000000000 */ + 1.84922337532043457031e-01, /* 0x3fc7ab8900000000 */ + 1.97825729846954345703e-01, /* 0x3fc9525a80000000 */ + 2.10564732551574707031e-01, /* 0x3fcaf3c900000000 */ + 2.23143517971038818359e-01, /* 0x3fcc8ff780000000 */ + 2.35566020011901855469e-01, /* 0x3fce270700000000 */ + 2.47836112976074218750e-01, /* 0x3fcfb91800000000 */ + 2.59957492351531982422e-01, /* 0x3fd0a324c0000000 */ + 2.71933674812316894531e-01, /* 0x3fd1675c80000000 */ + 2.83768117427825927734e-01, /* 0x3fd22941c0000000 */ + 2.95464158058166503906e-01, /* 0x3fd2e8e280000000 */ + 3.07025015354156494141e-01, /* 0x3fd3a64c40000000 */ + 3.18453729152679443359e-01, /* 0x3fd4618bc0000000 */ + 3.29753279685974121094e-01, /* 0x3fd51aad80000000 */ + 3.40926527976989746094e-01, /* 0x3fd5d1bd80000000 */ + 3.51976394653320312500e-01, /* 0x3fd686c800000000 */ + 3.62905442714691162109e-01, /* 0x3fd739d7c0000000 */ + 3.73716354370117187500e-01, /* 0x3fd7eaf800000000 */ + 3.84411692619323730469e-01, /* 0x3fd89a3380000000 */ + 3.94993782043457031250e-01, /* 0x3fd9479400000000 */ + 4.05465066432952880859e-01, /* 0x3fd9f323c0000000 */ + 4.15827870368957519531e-01, /* 0x3fda9cec80000000 */ + 4.26084339618682861328e-01, /* 0x3fdb44f740000000 */ + 4.36236739158630371094e-01, /* 0x3fdbeb4d80000000 */ + 4.46287095546722412109e-01, /* 0x3fdc8ff7c0000000 */ + 4.56237375736236572266e-01, /* 0x3fdd32fe40000000 */ + 4.66089725494384765625e-01, /* 0x3fddd46a00000000 */ + 4.75845873355865478516e-01, /* 0x3fde744240000000 */ + 4.85507786273956298828e-01, /* 0x3fdf128f40000000 */ + 4.95077252388000488281e-01, /* 0x3fdfaf5880000000 */ + 5.04556000232696533203e-01, /* 0x3fe02552a0000000 */ + 5.13945698738098144531e-01, /* 0x3fe0723e40000000 */ + 5.23248136043548583984e-01, /* 0x3fe0be72e0000000 */ + 5.32464742660522460938e-01, /* 0x3fe109f380000000 */ + 5.41597247123718261719e-01, /* 0x3fe154c3c0000000 */ + 5.50647079944610595703e-01, /* 0x3fe19ee6a0000000 */ + 5.59615731239318847656e-01, /* 0x3fe1e85f40000000 */ + 5.68504691123962402344e-01, /* 0x3fe23130c0000000 */ + 5.77315330505371093750e-01, /* 0x3fe2795e00000000 */ + 5.86049020290374755859e-01, /* 0x3fe2c0e9e0000000 */ + 5.94707071781158447266e-01, /* 0x3fe307d720000000 */ + 6.03290796279907226562e-01, /* 0x3fe34e2880000000 */ + 6.11801505088806152344e-01, /* 0x3fe393e0c0000000 */ + 6.20240390300750732422e-01, /* 0x3fe3d90260000000 */ + 6.28608644008636474609e-01, /* 0x3fe41d8fe0000000 */ + 6.36907458305358886719e-01, /* 0x3fe4618bc0000000 */ + 6.45137906074523925781e-01, /* 0x3fe4a4f840000000 */ + 6.53301239013671875000e-01, /* 0x3fe4e7d800000000 */ + 6.61398470401763916016e-01, /* 0x3fe52a2d20000000 */ + 6.69430613517761230469e-01, /* 0x3fe56bf9c0000000 */ + 6.77398800849914550781e-01, /* 0x3fe5ad4040000000 */ + 6.85303986072540283203e-01, /* 0x3fe5ee02a0000000 */ + 6.93147122859954833984e-01}; /* 0x3fe62e42e0000000 */ + + static const double ln_tail_table[65] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 5.15092497094772879206e-09, /* 0x3e361f807c79f3db */ + 4.55457209735272790188e-08, /* 0x3e6873c1980267c8 */ + 2.86612990859791781788e-08, /* 0x3e5ec65b9f88c69e */ + 2.23596477332056055352e-08, /* 0x3e58022c54cc2f99 */ + 3.49498983167142274770e-08, /* 0x3e62c37a3a125330 */ + 3.23392843005887000414e-08, /* 0x3e615cad69737c93 */ + 1.35722380472479366661e-08, /* 0x3e4d256ab1b285e9 */ + 2.56504325268044191098e-08, /* 0x3e5b8abcb97a7aa2 */ + 5.81213608741512136843e-08, /* 0x3e6f34239659a5dc */ + 5.59374849578288093334e-08, /* 0x3e6e07fd48d30177 */ + 5.06615629004996189970e-08, /* 0x3e6b32df4799f4f6 */ + 5.24588857848400955725e-08, /* 0x3e6c29e4f4f21cf8 */ + 9.61968535632653505972e-10, /* 0x3e1086c848df1b59 */ + 1.34829655346594463137e-08, /* 0x3e4cf456b4764130 */ + 3.65557749306383026498e-08, /* 0x3e63a02ffcb63398 */ + 3.33431709374069198903e-08, /* 0x3e61e6a6886b0976 */ + 5.13008650536088382197e-08, /* 0x3e6b8abcb97a7aa2 */ + 5.09285070380306053751e-08, /* 0x3e6b578f8aa35552 */ + 3.20853940845502057341e-08, /* 0x3e6139c871afb9fc */ + 4.06713248643004200446e-08, /* 0x3e65d5d30701ce64 */ + 5.57028186706125221168e-08, /* 0x3e6de7bcb2d12142 */ + 5.48356693724804282546e-08, /* 0x3e6d708e984e1664 */ + 1.99407553679345001938e-08, /* 0x3e556945e9c72f36 */ + 1.96585517245087232086e-09, /* 0x3e20e2f613e85bda */ + 6.68649386072067321503e-09, /* 0x3e3cb7e0b42724f6 */ + 5.89936034642113390002e-08, /* 0x3e6fac04e52846c7 */ + 2.85038578721554472484e-08, /* 0x3e5e9b14aec442be */ + 5.09746772910284482606e-08, /* 0x3e6b5de8034e7126 */ + 5.54234668933210171467e-08, /* 0x3e6dc157e1b259d3 */ + 6.29100830926604004874e-09, /* 0x3e3b05096ad69c62 */ + 2.61974119468563937716e-08, /* 0x3e5c2116faba4cdd */ + 4.16752115011186398935e-08, /* 0x3e665fcc25f95b47 */ + 2.47747534460820790327e-08, /* 0x3e5a9a08498d4850 */ + 5.56922172017964209793e-08, /* 0x3e6de647b1465f77 */ + 2.76162876992552906035e-08, /* 0x3e5da71b7bf7861d */ + 7.08169709942321478061e-09, /* 0x3e3e6a6886b09760 */ + 5.77453510221151779025e-08, /* 0x3e6f0075eab0ef64 */ + 4.43021445893361960146e-09, /* 0x3e33071282fb989b */ + 3.15140984357495864573e-08, /* 0x3e60eb43c3f1bed2 */ + 2.95077445089736670973e-08, /* 0x3e5faf06ecb35c84 */ + 1.44098510263167149349e-08, /* 0x3e4ef1e63db35f68 */ + 1.05196987538551827693e-08, /* 0x3e469743fb1a71a5 */ + 5.23641361722697546261e-08, /* 0x3e6c1cdf404e5796 */ + 7.72099925253243069458e-09, /* 0x3e4094aa0ada625e */ + 5.62089493829364197156e-08, /* 0x3e6e2d4c96fde3ec */ + 3.53090261098577946927e-08, /* 0x3e62f4d5e9a98f34 */ + 3.80080516835568242269e-08, /* 0x3e6467c96ecc5cbe */ + 5.66961038386146408282e-08, /* 0x3e6e7040d03dec5a */ + 4.42287063097349852717e-08, /* 0x3e67bebf4282de36 */ + 3.45294525105681104660e-08, /* 0x3e6289b11aeb783f */ + 2.47132034530447431509e-08, /* 0x3e5a891d1772f538 */ + 3.59655343422487209774e-08, /* 0x3e634f10be1fb591 */ + 5.51581770357780862071e-08, /* 0x3e6d9ce1d316eb93 */ + 3.60171867511861372793e-08, /* 0x3e63562a19a9c442 */ + 1.94511067964296180547e-08, /* 0x3e54e2adf548084c */ + 1.54137376631349347838e-08, /* 0x3e508ce55cc8c97a */ + 3.93171034490174464173e-09, /* 0x3e30e2f613e85bda */ + 5.52990607758839766440e-08, /* 0x3e6db03ebb0227bf */ + 3.29990737637586136511e-08, /* 0x3e61b75bb09cb098 */ + 1.18436010922446096216e-08, /* 0x3e496f16abb9df22 */ + 4.04248680368301346709e-08, /* 0x3e65b3f399411c62 */ + 2.27418915900284316293e-08, /* 0x3e586b3e59f65355 */ + 1.70263791333409206020e-08, /* 0x3e52482ceae1ac12 */ + 5.76999904754328540596e-08}; /* 0x3e6efa39ef35793c */ + + static const double + log2 = 6.931471805599453e-01, /* 0x3fe62e42fefa39ef */ + + /* Approximating polynomial coefficients */ + cb_1 = 8.33333333333333593622e-02, /* 0x3fb5555555555557 */ + cb_2 = 1.24999999978138668903e-02; /* 0x3f89999999865ede */ + + GET_BITS_SP32(x, ux); + ax = ux & ~SIGNBIT_SP32; + + if ((ux & EXPBITS_SP32) == EXPBITS_SP32) + { + /* x is either NaN or infinity */ + if (ux & MANTBITS_SP32) + { + /* x is NaN */ +#ifdef WINDOWS + return handle_errorf(_FUNCNAME, ux|0x00400000, _DOMAIN, + 0, EDOM, x, 0.0F); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + } + else + { + /* x is infinity */ + if (ux & SIGNBIT_SP32) + { + /* x is negative infinity. Return a NaN. */ +#ifdef WINDOWS + return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0F); +#else + return retval_errno_edom(x); +#endif + } + else + return x; + } + } + else if (ux >= 0xbf800000) + { + /* x <= -1.0 */ + if (ux > 0xbf800000) + { + /* x is less than -1.0. Return a NaN. */ +#ifdef WINDOWS + return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0F); +#else + return retval_errno_edom(x); +#endif + } + else + { + /* x is exactly -1.0. Return -infinity with div-by-zero flag. */ +#ifdef WINDOWS + return handle_errorf(_FUNCNAME, NINFBITPATT_SP32, _SING, + AMD_F_DIVBYZERO, ERANGE, x, 0.0F); +#else + return retval_errno_erange_overflow(x); +#endif + } + } + else if (ax < 0x33800000) + { + if (ax == 0x00000000) + { + /* x is +/-zero. Return the same zero. */ + return x; + } + else + /* abs(x) is less than float epsilon. Return x with inexact. */ + return valf_with_flags(x, AMD_F_INEXACT); + } + + dx = x; + /* + First, we decompose the argument dx to the form + 1 + dx = 2**M * (F1 + F2), + where 1 <= F1+F2 < 2, M has the value of an integer, + F1 = 1 + j/64, j ranges from 0 to 64, and |F2| <= 1/128. + + Second, we approximate log( 1 + F2/F1 ) by an odd polynomial + in U, where U = 2 F2 / (2 F2 + F1). + Note that log( 1 + F2/F1 ) = log( 1 + U/2 ) - log( 1 - U/2 ). + The core approximation calculates + Poly = [log( 1 + U/2 ) - log( 1 - U/2 )]/U - 1. + Note that log(1 + U/2) - log(1 - U/2) = 2 arctanh ( U/2 ), + thus, Poly = 2 arctanh( U/2 ) / U - 1. + + It is not hard to see that + log(dx) = M*log(2) + log(F1) + log( 1 + F2/F1 ). + Hence, we return Z1 = log(F1), and Z2 = log( 1 + F2/F1). + The values of log(F1) are calculated beforehand and stored + in the program. + */ + + f = 1.0 + dx; + GET_BITS_DP64(f, lux); + + /* Store the exponent of f = 1 + dx in xexp and put + f into the range [1.0,2.0) */ + xexp = (int)((lux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64; + PUT_BITS_DP64((lux & MANTBITS_DP64) | ONEEXPBITS_DP64, f); + + /* Now (1+dx) = 2**(xexp) * f, 1 <= f < 2. */ + + /* Set index to be the nearest integer to 64*f */ + /* 64 <= index <= 128 */ + /* + r = 64.0 * f; + index = (int)(r + 0.5); + */ + /* This code instead of the above can save several cycles. + It only works because 64 <= r < 128, so + the nearest integer is always contained in exactly + 7 bits, and the right shift is always the same. */ + index = (int)((((lux & 0x000fc00000000000) | 0x0010000000000000) >> 46) + + ((lux & 0x0000200000000000) >> 45)); + + f1 = index * 0.015625; /* 0.015625 = 1/64 */ + index -= 64; + + /* Now take great care to compute f2 such that f1 + f2 = f */ + if (xexp <= -2 || xexp >= MANTLENGTH_DP64 + 8) + { + f2 = f - f1; + } + else + { + /* Create the number m2 = 2.0^(-xexp) */ + lux = (unsigned long long)(0x3ff - xexp) << EXPSHIFTBITS_DP64; + PUT_BITS_DP64(lux,m2); + if (xexp <= MANTLENGTH_DP64 - 1) + { + f2 = (m2 - f1) + m2*dx; + } + else + { + f2 = (m2*dx - f1) + m2; + } + } + + /* At this point, dx = 2**xexp * ( f1 + f2 ) where + f1 = j/64, j = 1, 2, ..., 64 and |f2| <= 1/128. */ + + z1 = ln_lead_table[index]; + q = ln_tail_table[index]; + + /* Calculate u = 2 f2 / ( 2 f1 + f2 ) = f2 / ( f1 + 0.5*f2 ) */ + u = f2 / (f1 + 0.5 * f2); + + /* Here, |u| <= 2(exp(1/16)-1) / (exp(1/16)+1). + The core approximation calculates + poly = [log(1 + u/2) - log(1 - u/2)]/u - 1 */ + v = u * u; + poly = (v * (cb_1 + v * cb_2)); + z2 = q + (u + u * poly); + + /* Now z1,z2 is an extra-precise approximation of log(f). */ + + /* Add xexp * log(2) to z1,z2 to get the result log(1+x). */ + r = xexp * log2 + z1 + z2; + /* Natural log(1+x) */ + return (float)r; +} + +weak_alias (__log1pf, log1pf)
diff --git a/src/log_special.c b/src/log_special.c new file mode 100644 index 0000000..53a92b8 --- /dev/null +++ b/src/log_special.c
@@ -0,0 +1,141 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + +#ifdef __x86_64__ + +#include <emmintrin.h> +#include <math.h> +#include <errno.h> + +#include "../inc/libm_util_amd.h" +#include "../inc/libm_special.h" + +// y = log10f(x) +// y = log10(x) +// y = logf(x) +// y = log(x) + +// these codes and the ones in the related .S or .asm files have to match +#define LOG_X_ZERO 1 +#define LOG_X_NEG 2 +#define LOG_X_NAN 3 + +static float _logf_special_common(float x, float y, U32 code, const char *name) +{ + switch(code) + { + case LOG_X_ZERO: + { + _mm_setcsr(_mm_getcsr() | MXCSR_ES_DIVBYZERO); + __amd_handle_errorf(SING, ERANGE, name, x, 0, 0.0f, 0, y, 0); + } + break; + + case LOG_X_NEG: + { + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); + __amd_handle_errorf(DOMAIN, EDOM, name, x, 0, 0.0f, 0, y, 0); + } + break; + + case LOG_X_NAN: + { +#ifdef WIN64 + // y is assumed to be qnan, only check x for snan + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, EDOM, name, x, is_x_snan, 0.0f, 0, y, 0); +#else + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); +#endif + } + break; + } + + return y; +} + +float _logf_special(float x, float y, U32 code) +{ + return _logf_special_common(x, y, code, "logf"); +} + +float _log10f_special(float x, float y, U32 code) +{ + return _logf_special_common(x, y, code, "log10f"); +} + +float _log2f_special(float x, float y, U32 code) +{ + return _logf_special_common(x, y, code, "log2f"); +} + +static double _log_special_common(double x, double y, U32 code, + const char *name) +{ + switch(code) + { + case LOG_X_ZERO: + { + _mm_setcsr(_mm_getcsr() | MXCSR_ES_DIVBYZERO); + __amd_handle_error(SING, ERANGE, name, x, 0.0, y); + } + break; + + case LOG_X_NEG: + { + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); + __amd_handle_error(DOMAIN, EDOM, name, x, 0.0, y); + } + break; + + case LOG_X_NAN: + { +#ifdef WIN64 + __amd_handle_error(DOMAIN, EDOM, name, x, 0.0, y); +#else + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); +#endif + } + break; + } + + return y; +} + +double _log_special(double x, double y, U32 code) +{ + return _log_special_common(x, y, code, "log"); +} + +double _log10_special(double x, double y, U32 code) +{ + return _log_special_common(x, y, code, "log10"); +} + +double _log2_special(double x, double y, U32 code) +{ + return _log_special_common(x, y, code, "log2"); +} + +#endif /* __x86_64__ */
diff --git a/src/logb.c b/src/logb.c new file mode 100644 index 0000000..7c75ef1 --- /dev/null +++ b/src/logb.c
@@ -0,0 +1,102 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_INFINITY_WITH_FLAGS +#define USE_HANDLE_ERROR +#include "../inc/libm_inlines_amd.h" +#undef USE_INFINITY_WITH_FLAGS +#undef USE_HANDLE_ERROR + +#ifdef WINDOWS +#include "../inc/libm_errno_amd.h" +#endif + +#ifdef WINDOWS +double FN_PROTOTYPE(logb)(double x) +#else +double FN_PROTOTYPE(logb)(double x) +#endif +{ + + unsigned long long ux; + long long u; + GET_BITS_DP64(x, ux); + u = ((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64; + if ((ux & ~SIGNBIT_DP64) == 0) + /* x is +/-zero. Return -infinity with div-by-zero flag. */ +#ifdef WINDOWS + return handle_error("logb", NINFBITPATT_DP64, _SING, + AMD_F_DIVBYZERO, ERANGE, x, 0.0); +#else + return -infinity_with_flags(AMD_F_DIVBYZERO); +#endif + else if (EMIN_DP64 <= u && u <= EMAX_DP64) + /* x is a normal number */ + return (double)u; + else if (u > EMAX_DP64) + { + /* x is infinity or NaN */ + if ((ux & MANTBITS_DP64) == 0) +#ifdef WINDOWS + /* x is +/-infinity. For VC++, return infinity of same sign. */ + return x; +#else + /* x is +/-infinity. Return +infinity with no flags. */ + return infinity_with_flags(0); +#endif + else + /* x is NaN, result is NaN */ +#ifdef WINDOWS + return handle_error("logb", ux|0x0008000000000000, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + } + else + { + /* x is denormalized. */ +#ifdef FOLLOW_IEEE754_LOGB + /* Return the value of the minimum exponent to ensure that + the relationship between logb and scalb, defined in + IEEE 754, holds. */ + return EMIN_DP64; +#else + /* Follow the rule set by IEEE 854 for logb */ + ux &= MANTBITS_DP64; + u = EMIN_DP64; + while (ux < IMPBIT_DP64) + { + ux <<= 1; + u--; + } + return (double)u; +#endif + } + +} + +weak_alias (__logb, logb)
diff --git a/src/logbf.c b/src/logbf.c new file mode 100644 index 0000000..d64e531 --- /dev/null +++ b/src/logbf.c
@@ -0,0 +1,100 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_INFINITYF_WITH_FLAGS +#define USE_HANDLE_ERRORF +#include "../inc/libm_inlines_amd.h" +#undef USE_INFINITYF_WITH_FLAGS +#undef USE_HANDLE_ERRORF + +#ifdef WINDOWS +#include "../inc/libm_errno_amd.h" +#endif + +#ifdef WINDOWS +float FN_PROTOTYPE(logbf)(float x) +#else +float FN_PROTOTYPE(logbf)(float x) +#endif +{ + unsigned int ux; + int u; + GET_BITS_SP32(x, ux); + u = ((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32; + if ((ux & ~SIGNBIT_SP32) == 0) + /* x is +/-zero. Return -infinity with div-by-zero flag. */ +#ifdef WINDOWS + return handle_errorf("logbf", NINFBITPATT_SP32, _SING, + AMD_F_DIVBYZERO, ERANGE, x, 0.0F); +#else + return -infinityf_with_flags(AMD_F_DIVBYZERO); +#endif + else if (EMIN_SP32 <= u && u <= EMAX_SP32) + /* x is a normal number */ + return (float)u; + else if (u > EMAX_SP32) + { + /* x is infinity or NaN */ + if ((ux & MANTBITS_SP32) == 0) +#ifdef WINDOWS + /* x is +/-infinity. For VC++, return infinity of same sign. */ + return x; +#else + /* x is +/-infinity. Return +infinity with no flags. */ + return infinityf_with_flags(0); +#endif + else + /* x is NaN, result is NaN */ +#ifdef WINDOWS + return handle_errorf("logbf", ux|0x00400000, _DOMAIN, + AMD_F_INVALID, EDOM, x, 0.0F); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + } + else + { + /* x is denormalized. */ +#ifdef FOLLOW_IEEE754_LOGB + /* Return the value of the minimum exponent to ensure that + the relationship between logb and scalb, defined in + IEEE 754, holds. */ + return EMIN_SP32; +#else + /* Follow the rule set by IEEE 854 for logb */ + ux &= MANTBITS_SP32; + u = EMIN_SP32; + while (ux < IMPBIT_SP32) + { + ux <<= 1; + u--; + } + return (float)u; +#endif + } +} + +weak_alias (__logbf, logbf)
diff --git a/src/lrint.c b/src/lrint.c new file mode 100644 index 0000000..e3c0e41 --- /dev/null +++ b/src/lrint.c
@@ -0,0 +1,62 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> + +#include "libm_amd.h" +#include "libm_util_amd.h" +#include "libm_special.h" + + + +long int FN_PROTOTYPE(lrint)(double x) +{ + + UT64 checkbits,val_2p52; + checkbits.f64=x; + + /* Clear the sign bit and check if the value can be rounded */ + + if( (checkbits.u64 & 0x7FFFFFFFFFFFFFFF) > 0x4330000000000000) + { + /* number cant be rounded raise an exception */ + /* Number exceeds the representable range could be nan or inf also*/ + __amd_handle_error(DOMAIN, EDOM, "lrint", x,0.0 ,(double)x); + + + return (long int) x; + } + + val_2p52.u32[1] = (checkbits.u32[1] & 0x80000000) | 0x43300000; + val_2p52.u32[0] = 0; + + /* Add and sub 2^52 to round the number according to the current rounding direction */ + + return (long int) ((x + val_2p52.f64) - val_2p52.f64); +}
diff --git a/src/lrintf.c b/src/lrintf.c new file mode 100644 index 0000000..abcd37b --- /dev/null +++ b/src/lrintf.c
@@ -0,0 +1,67 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> + +#include "libm_amd.h" +#include "libm_util_amd.h" +#include "libm_special.h" + + + +long int FN_PROTOTYPE(lrintf)(float x) +{ + + UT32 checkbits,val_2p23; + checkbits.f32=x; + + /* Clear the sign bit and check if the value can be rounded */ + + if( (checkbits.u32 & 0x7FFFFFFF) > 0x4B000000) + { + /* number cant be rounded raise an exception */ + /* Number exceeds the representable range could be nan or inf also*/ + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, EDOM, "lrintf", x, is_x_snan, 0.0F , 0,(float)x, 0); + } + + return (long int) x; + } + + + val_2p23.u32 = (checkbits.u32 & 0x80000000) | 0x4B000000; + + /* Add and sub 2^23 to round the number according to the current rounding direction */ + + return (long int) ((x + val_2p23.f32) - val_2p23.f32); +}
diff --git a/src/lround.c b/src/lround.c new file mode 100644 index 0000000..dfe411d --- /dev/null +++ b/src/lround.c
@@ -0,0 +1,135 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include "../inc/libm_special.h" + +long int FN_PROTOTYPE(lround)(double d) +{ + UT64 u64d; + UT64 u64Temp,u64result; + int intexp, shift; + U64 sign; + long int result; + + u64d.f64 = u64Temp.f64 = d; + + if ((u64d.u32[1] & 0X7FF00000) == 0x7FF00000) + { + /*else the number is infinity*/ + //Raise range or domain error + #ifdef WIN64 + __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double)SIGNBIT_SP32); + return (long int )SIGNBIT_SP32; + #else + __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double)SIGNBIT_DP64); + return SIGNBIT_DP64; /*GCC returns this when the number is out of range*/ + #endif + + } + + u64Temp.u32[1] &= 0x7FFFFFFF; + intexp = (u64d.u32[1] & 0x7FF00000) >> 20; + sign = u64d.u64 & 0x8000000000000000; + intexp -= 0x3FF; + + /* 1.0 x 2^-1 is the smallest number which can be rounded to 1 */ + if (intexp < -1) + return (0); + +#ifdef WIN64 + /* 1.0 x 2^31 (or 2^63) is already too large */ + if (intexp >= 31) + { + /*Based on the sign of the input value return the MAX and MIN*/ + result = 0x80000000; /*Return LONG MIN*/ + + __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double) result); + + return result; + } + + +#else + /* 1.0 x 2^31 (or 2^63) is already too large */ + if (intexp >= 63) + { + /*Based on the sign of the input value return the MAX and MIN*/ + result = 0x8000000000000000; /*Return LONG MIN*/ + + __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double) result); + + return result; + } + +#endif + + u64result.f64 = u64Temp.f64; + /* >= 2^52 is already an exact integer */ +#ifdef WIN64 + if (intexp < 23) +#else + if (intexp < 52) +#endif + { + /* add 0.5, extraction below will truncate */ + u64result.f64 = u64Temp.f64 + 0.5; + } + + intexp = ((u64result.u32[1] >> 20) & 0x7ff) - 0x3FF; + + u64result.u32[1] &= 0xfffff; + u64result.u32[1] |= 0x00100000; /*Mask the last exp bit to 1*/ + shift = intexp - 52; + +#ifdef WIN64 + /*The shift value will always be negative.*/ + u64result.u64 = u64result.u64 >> (-shift); + /*Result will be stored in the lower word due to the shift being performed*/ + result = u64result.u32[0]; +#else + if(shift < 0) + u64result.u64 = u64result.u64 >> (-shift); + if(shift > 0) + u64result.u64 = u64result.u64 << (shift); + + result = u64result.u64; +#endif + + + + if (sign) + result = -result; + + return result; +} +
diff --git a/src/lroundf.c b/src/lroundf.c new file mode 100644 index 0000000..799e960 --- /dev/null +++ b/src/lroundf.c
@@ -0,0 +1,147 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include "../inc/libm_special.h" + +long int FN_PROTOTYPE(lroundf)(float f) +{ + UT32 u32d; + UT32 u32Temp,u32result; + int intexp, shift; + U32 sign; + long int result; + + u32d.f32 = u32Temp.f32 = f; + if ((u32d.u32 & 0X7F800000) == 0x7F800000) + { + /*else the number is infinity*/ + //Raise range or domain error + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = f; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + #ifdef WIN64 + __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)SIGNBIT_SP32, 0); + return (long int)SIGNBIT_SP32; + #else + __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)SIGNBIT_DP64, 0); + return SIGNBIT_DP64; /*GCC returns this when the number is out of range*/ + #endif + } + + } + + u32Temp.u32 &= 0x7FFFFFFF; + intexp = (u32d.u32 & 0x7F800000) >> 23; + sign = u32d.u32 & 0x80000000; + intexp -= 0x7F; + + + /* 1.0 x 2^-1 is the smallest number which can be rounded to 1 */ + if (intexp < -1) + return (0); + + +#ifdef WIN64 + /* 1.0 x 2^31 is already too large */ + if (intexp >= 31) + { + result = 0x80000000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = f; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)result, 0); + } + + return result; + } + +#else + /* 1.0 x 2^31 (or 2^63) is already too large */ + if (intexp >= 63) + { + result = 0x8000000000000000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = f; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)result, 0); + } + + return result; + } + #endif + + u32result.f32 = u32Temp.f32; + + /* >= 2^23 is already an exact integer */ + if (intexp < 23) + { + /* add 0.5, extraction below will truncate */ + u32result.f32 = u32Temp.f32 + 0.5F; + } + intexp = (u32result.u32 & 0x7f800000) >> 23; + intexp -= 0x7f; + u32result.u32 &= 0x7fffff; + u32result.u32 |= 0x00800000; + + result = u32result.u32; + + #ifdef WIN64 + shift = intexp - 23; + #else + + /*Since float is only 32 bit for higher accuracy we shift the result by 32 bits + * In the next step we shift an extra 32 bits in the reverse direction based + * on the value of intexp*/ + result = result << 32; + shift = intexp - 55; /*55= 23 +32*/ + #endif + + + if(shift < 0) + result = result >> (-shift); + if(shift > 0) + result = result << (shift); + + if (sign) + result = -result; + return result; + +} + + +
diff --git a/src/modf.c b/src/modf.c new file mode 100644 index 0000000..836db46 --- /dev/null +++ b/src/modf.c
@@ -0,0 +1,80 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +double FN_PROTOTYPE(modf)(double x, double *iptr) +{ + /* modf splits the argument x into integer and fraction parts, + each with the same sign as x. */ + + + long long xexp; + unsigned long long ux, ax, mask; + + GET_BITS_DP64(x, ux); + ax = ux & (~SIGNBIT_DP64); + + if (ax >= 0x4340000000000000) + { + /* abs(x) is either NaN, infinity, or >= 2^53 */ + if (ax > 0x7ff0000000000000) + { + /* x is NaN */ + *iptr = x; + return x + x; /* Raise invalid if it is a signalling NaN */ + } + else + { + /* x is infinity or large. Return zero with the sign of x */ + *iptr = x; + PUT_BITS_DP64(ux & SIGNBIT_DP64, x); + return x; + } + } + else if (ax < 0x3ff0000000000000) + { + /* abs(x) < 1.0. Set iptr to zero with the sign of x + and return x. */ + PUT_BITS_DP64(ux & SIGNBIT_DP64, *iptr); + return x; + } + else + { + double r; + unsigned long long ur; + xexp = ((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64; + /* Mask out the bits of x that we don't want */ + mask = 1; + mask = (mask << (EXPSHIFTBITS_DP64 - xexp)) - 1; + PUT_BITS_DP64(ux & ~mask, *iptr); + r = x - *iptr; + GET_BITS_DP64(r, ur); + PUT_BITS_DP64(((ux & SIGNBIT_DP64)|ur), r); + return r; + } + +} + +weak_alias (__modf, modf)
diff --git a/src/modff.c b/src/modff.c new file mode 100644 index 0000000..7e5eae7 --- /dev/null +++ b/src/modff.c
@@ -0,0 +1,74 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +float FN_PROTOTYPE(modff)(float x, float *iptr) +{ + /* modff splits the argument x into integer and fraction parts, + each with the same sign as x. */ + + unsigned int ux, mask; + int xexp; + + GET_BITS_SP32(x, ux); + xexp = ((ux & (~SIGNBIT_SP32)) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32; + + if (xexp < 0) + { + /* abs(x) < 1.0. Set iptr to zero with the sign of x + and return x. */ + PUT_BITS_SP32(ux & SIGNBIT_SP32, *iptr); + return x; + } + else if (xexp < EXPSHIFTBITS_SP32) + { + float r; + unsigned int ur; + /* x lies between 1.0 and 2**(24) */ + /* Mask out the bits of x that we don't want */ + mask = (1 << (EXPSHIFTBITS_SP32 - xexp)) - 1; + PUT_BITS_SP32(ux & ~mask, *iptr); + r = x - *iptr; + GET_BITS_SP32(r, ur); + PUT_BITS_SP32(((ux & SIGNBIT_SP32)|ur), r); + return r; + } + else if ((ux & (~SIGNBIT_SP32)) > 0x7f800000) + { + /* x is NaN */ + *iptr = x; + return x + x; /* Raise invalid if it is a signalling NaN */ + } + else + { + /* x is infinity or large. Set iptr to x and return zero + with the sign of x. */ + *iptr = x; + PUT_BITS_SP32(ux & SIGNBIT_SP32, x); + return x; + } +} + +weak_alias (__modff, modff)
diff --git a/src/nan.c b/src/nan.c new file mode 100644 index 0000000..fbfc52c --- /dev/null +++ b/src/nan.c
@@ -0,0 +1,114 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" +#include <stdio.h> + +double FN_PROTOTYPE(nan)(const char *tagp) +{ + + + /* Check for input range */ + UT64 checkbits; + U64 val=0; + S64 num; + checkbits.u64 =QNANBITPATT_DP64; + if(tagp == NULL) + { + return checkbits.f64; + } + + switch(*tagp) + { + case '0': /* base 8 */ + tagp++; + if( *tagp == 'x' || *tagp == 'X') + { + /* base 16 */ + tagp++; + while(*tagp != '\0') + { + + if(*tagp >= 'A' && *tagp <= 'F' ) + { + num = *tagp - 'A' + 10; + } + else + if(*tagp >= 'a' && *tagp <= 'f' ) + { + num = *tagp - 'a' + 10; + } + else + { + num = *tagp - '0'; + } + + if( (num < 0 || num > 15)) + { + val = QNANBITPATT_DP64; + break; + } + val = (val << 4) | num; + tagp++; + } + } + else + { + /* base 8 */ + while(*tagp != '\0') + { + num = *tagp - '0'; + if( num < 0 || num > 7) + { + val = QNANBITPATT_DP64; + break; + } + val = (val << 3) | num; + tagp++; + } + } + break; + default: + while(*tagp != '\0') + { + val = val*10; + num = *tagp - '0'; + if( num < 0 || num > 9) + { + val = QNANBITPATT_DP64; + break; + } + val = val + num; + tagp++; + } + + } + + if((val & ~NINFBITPATT_DP64) == 0) + val = QNANBITPATT_DP64; + + checkbits.u64 = (val | QNANBITPATT_DP64) & ~SIGNBIT_DP64; + return checkbits.f64 ; +} +
diff --git a/src/nanf.c b/src/nanf.c new file mode 100644 index 0000000..8d712f2 --- /dev/null +++ b/src/nanf.c
@@ -0,0 +1,120 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" +#include <stdio.h> + + +float FN_PROTOTYPE(nanf)(const char *tagp) +{ + + + /* Check for input range */ + UT32 checkbits; + U32 val=0; + S32 num; + checkbits.u32 =QNANBITPATT_SP32; + if(tagp == NULL) + return checkbits.f32 ; + + + switch(*tagp) + { + case '0': /* base 8 */ + tagp++; + if( *tagp == 'x' || *tagp == 'X') + { + /* base 16 */ + tagp++; + while(*tagp != '\0') + { + + if(*tagp >= 'A' && *tagp <= 'F' ) + { + num = *tagp - 'A' + 10; + } + else + if(*tagp >= 'a' && *tagp <= 'f' ) + { + num = *tagp - 'a' + 10; + } + else + { + num = *tagp - '0'; + } + + if( (num < 0 || num > 15)) + { + val = QNANBITPATT_SP32; + break; + } + val = (val << 4) | num; + tagp++; + } + } + else + { + /* base 8 */ + while(*tagp != '\0') + { + num = *tagp - '0'; + if( num < 0 || num > 7) + { + val = QNANBITPATT_SP32; + break; + } + val = (val << 3) | num; + tagp++; + } + } + break; + default: + while(*tagp != '\0') + { + val = val*10; + num = *tagp - '0'; + if( num < 0 || num > 9) + { + val = QNANBITPATT_SP32; + break; + } + val = val + num; + tagp++; + } + + } + +/* if(val > ~INDEFBITPATT_SP32) + val = (val | QNANBITPATT_SP32) & ~SIGNBIT_SP32; + + checkbits.u32 = val | EXPBITS_SP32 ; */ + + if((val & ~INDEFBITPATT_SP32) == 0) + val = QNANBITPATT_SP32; + + checkbits.u32 = (val | QNANBITPATT_SP32) & ~SIGNBIT_SP32; + + + return checkbits.f32 ; +}
diff --git a/src/nearbyintf.c b/src/nearbyintf.c new file mode 100644 index 0000000..2b656ef --- /dev/null +++ b/src/nearbyintf.c
@@ -0,0 +1,51 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + + +float FN_PROTOTYPE(nearbyintf)(float x) +{ + /* Check for input range */ + UT32 checkbits,sign,val_2p23; + checkbits.f32=x; + + /* Clear the sign bit and check if the value can be rounded(i.e check if exponent less than 23) */ + if( (checkbits.u32 & 0x7FFFFFFF) > 0x4B000000) + { + /* take care of nan or inf */ + if((checkbits.u32 & 0x7f800000)== 0x7f800000) + return x+x; + else + return x; + } + + sign.u32 = checkbits.u32 & 0x80000000; + val_2p23.u32 = sign.u32 | 0x4B000000; + val_2p23.f32 = (x + val_2p23.f32) - val_2p23.f32; + /*This extra line is to take care of denormals and various rounding modes*/ + val_2p23.u32 = ((val_2p23.u32 << 1) >> 1) | sign.u32; + return (val_2p23.f32); +} +
diff --git a/src/nextafter.c b/src/nextafter.c new file mode 100644 index 0000000..62d9b5a --- /dev/null +++ b/src/nextafter.c
@@ -0,0 +1,91 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#endif + +#include <float.h> +#include <math.h> +#include <errno.h> +#include <limits.h> + +#include "libm_amd.h" +#include "libm_util_amd.h" +#include "libm_special.h" + + + + +double FN_PROTOTYPE(nextafter)(double x, double y) +{ + + + UT64 checkbits; + double dy = y; + checkbits.f64=x; + + /* if x == y return y in the type of x */ + if( x == dy ) + { + return dy; + } + + /* check if the number is nan */ + if(((checkbits.u64 & ~SIGNBIT_DP64) >= EXPBITS_DP64 )) + { + __amd_handle_error(DOMAIN, ERANGE, "nextafter", x, y , x+x); + + return x+x; + } + + if( x == 0.0) + { + checkbits.u64 = 1; + if( dy > 0.0 ) + return checkbits.f64; + else + return -checkbits.f64; + } + + + /* compute the next heigher or lower value */ + + if(((x>0.0) ^ (dy>x)) == 0) + { + checkbits.u64++; + } + else + { + checkbits.u64--; + } + + /* check if the result is nan or inf */ + if(((checkbits.u64 & ~SIGNBIT_DP64) >= EXPBITS_DP64 )) + { + __amd_handle_error(DOMAIN, ERANGE, "nextafter", x, y , checkbits.f64); + + } + + return checkbits.f64; +}
diff --git a/src/nextafterf.c b/src/nextafterf.c new file mode 100644 index 0000000..019187f --- /dev/null +++ b/src/nextafterf.c
@@ -0,0 +1,102 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#endif + + +#include <float.h> +#include <math.h> +#include <errno.h> +#include <limits.h> + +#include "libm_amd.h" +#include "libm_util_amd.h" +#include "libm_special.h" + + + + +float FN_PROTOTYPE(nextafterf)(float x, float y) +{ + + + UT32 checkbits; + float dy = y; + checkbits.f32=x; + + /* if x == y return y in the type of x */ + if( x == dy ) + { + return dy; + } + + /* check if the number is nan */ + if(((checkbits.u32 & ~SIGNBIT_SP32) >= EXPBITS_SP32 )) + { + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, ERANGE, "nextafterf", x, is_x_snan, y , 0,x+x, 0); + + } + + return x+x; + } + + if( x == 0.0) + { + checkbits.u32 = 1; + if( dy > 0.0 ) + return checkbits.f32; + else + return -checkbits.f32; + } + + + /* compute the next heigher or lower value */ + if(((x>0.0F) ^ (dy>x)) == 0) + { + checkbits.u32++; + } + else + { + checkbits.u32--; + } + + /* check if the result is nan or inf */ + if(((checkbits.u32 & ~SIGNBIT_SP32) >= EXPBITS_SP32 )) + { + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, ERANGE, "nextafterf", x, is_x_snan, y , 0,checkbits.f32, 0); + + } + } + + return checkbits.f32; +}
diff --git a/src/nexttoward.c b/src/nexttoward.c new file mode 100644 index 0000000..14b2f62 --- /dev/null +++ b/src/nexttoward.c
@@ -0,0 +1,93 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> + +#include "libm_amd.h" +#include "libm_util_amd.h" +#include "libm_special.h" + + + + +double FN_PROTOTYPE(nexttoward)(double x, long double y) +{ + + + UT64 checkbits; + long double dy = (long double) y; + checkbits.f64=x; + + /* if x == y return y in the type of x */ + if( x == dy ) + { + return (double) dy; + } + + /* check if the number is nan */ + if(((checkbits.u64 & ~SIGNBIT_DP64) >= EXPBITS_DP64 )) + { + + __amd_handle_error(DOMAIN, ERANGE, "nexttoward", x, (double)y ,x+x); + + + return x+x; + } + + if( x == 0.0) + { + checkbits.u64 = 1; + if( dy > 0.0 ) + return checkbits.f64; + else + return -checkbits.f64; + } + + + /* compute the next heigher or lower value */ + + if(((x>0.0) ^ (dy>x)) == 0) + { + checkbits.u64++; + } + else + { + checkbits.u64--; + } + + /* check if the result is nan or inf */ + if(((checkbits.u64 & ~SIGNBIT_DP64) >= EXPBITS_DP64 )) + { + __amd_handle_error(DOMAIN, ERANGE, "nexttoward", x, (double)y ,checkbits.f64); + + + } + + return checkbits.f64; +}
diff --git a/src/nexttowardf.c b/src/nexttowardf.c new file mode 100644 index 0000000..47b42c7 --- /dev/null +++ b/src/nexttowardf.c
@@ -0,0 +1,97 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> + +#include "libm_amd.h" +#include "libm_util_amd.h" +#include "libm_special.h" + + +float FN_PROTOTYPE(nexttowardf)(float x, long double y) +{ + + + UT32 checkbits; + long double dy = (long double) y; + checkbits.f32=x; + + /* if x == y return y in the type of x */ + if( x == dy ) + { + return (float) dy; + } + + /* check if the number is nan */ + if(((checkbits.u32 & ~SIGNBIT_SP32) >= EXPBITS_SP32 )) + { + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, ERANGE, "nexttowardf", x, is_x_snan, (float) y , 0,x+x, 0); + + } + + return x+x; + } + + if( x == 0.0) + { + checkbits.u32 = 1; + if( dy > 0.0 ) + return checkbits.f32; + else + return -checkbits.f32; + } + + + /* compute the next heigher or lower value */ + if(((x>0.0F) ^ (dy>x)) == 0) + { + checkbits.u32++; + } + else + { + checkbits.u32--; + } + + /* check if the result is nan or inf */ + if(((checkbits.u32 & ~SIGNBIT_SP32) >= EXPBITS_SP32 )) + { + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, ERANGE, "nexttowardf", x, is_x_snan, (float) y , 0,checkbits.f32, 0); + } + } + + return checkbits.f32; +}
diff --git a/src/pow_special.c b/src/pow_special.c new file mode 100644 index 0000000..cb571d2 --- /dev/null +++ b/src/pow_special.c
@@ -0,0 +1,168 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + +#ifdef __x86_64__ + +#include <emmintrin.h> +#include <math.h> +#include <errno.h> + + + +#include "../inc/libm_util_amd.h" +#include "../inc/libm_special.h" + +// these codes and the ones in the related .S or .asm files have to match +#define POW_X_ONE_Y_SNAN 1 +#define POW_X_ZERO_Z_INF 2 +#define POW_X_NAN 3 +#define POW_Y_NAN 4 +#define POW_X_NAN_Y_NAN 5 +#define POW_X_NEG_Y_NOTINT 6 +#define POW_Z_ZERO 7 +#define POW_Z_DENORMAL 8 +#define POW_Z_INF 9 + +float _powf_special(float x, float y, float z, U32 code) +{ + switch(code) + { + case POW_X_ONE_Y_SNAN: + { + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); + } + break; + + case POW_X_ZERO_Z_INF: + { + _mm_setcsr(_mm_getcsr() | MXCSR_ES_DIVBYZERO); + __amd_handle_errorf(SING, ERANGE, "powf", x, 0, y, 0, z, 0); + } + break; + + case POW_X_NAN: + case POW_Y_NAN: + case POW_X_NAN_Y_NAN: + { +#ifdef WIN64 + unsigned int is_x_snan = 0, is_y_snan = 0, is_z_snan = 0; + UT32 xm, ym, zm; + xm.f32 = x; + ym.f32 = y; + zm.f32 = z; + if(code == POW_X_NAN) { is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); } + if(code == POW_Y_NAN) { is_y_snan = ( ((ym.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); } + if(code == POW_X_NAN_Y_NAN) { is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + is_y_snan = ( ((ym.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); } + is_z_snan = ( ((zm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); + __amd_handle_errorf(DOMAIN, EDOM, "powf", x, is_x_snan, y, is_y_snan, z, is_z_snan); +#else + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); +#endif + } + break; + + case POW_X_NEG_Y_NOTINT: + { + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); + __amd_handle_errorf(DOMAIN, EDOM, "powf", x, 0, y, 0, z, 0); + } + break; + + case POW_Z_ZERO: + { + _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_UNDERFLOW)); + __amd_handle_errorf(UNDERFLOW, ERANGE, "powf", x, 0, y, 0, z, 0); + } + break; + + case POW_Z_INF: + { + _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_OVERFLOW)); + __amd_handle_errorf(OVERFLOW, ERANGE, "powf", x, 0, y, 0, z, 0); + } + break; + } + + return z; +} + +double _pow_special(double x, double y, double z, U32 code) +{ + switch(code) + { + case POW_X_ONE_Y_SNAN: + { +#ifdef WIN64 +#else + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); +#endif + } + break; + + case POW_X_ZERO_Z_INF: + { + _mm_setcsr(_mm_getcsr() | MXCSR_ES_DIVBYZERO); + __amd_handle_error(SING, ERANGE, "pow", x, y, z); + } + break; + + case POW_X_NAN: + case POW_Y_NAN: + case POW_X_NAN_Y_NAN: + { + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); +#ifdef WIN64 + __amd_handle_error(DOMAIN, EDOM, "pow", x, y, z); +#endif + } + break; + + case POW_X_NEG_Y_NOTINT: + { + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); + __amd_handle_error(DOMAIN, EDOM, "pow", x, y, z); + } + break; + + case POW_Z_ZERO: + case POW_Z_DENORMAL: + { + _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_UNDERFLOW)); + __amd_handle_error(UNDERFLOW, ERANGE, "pow", x, y, z); + } + break; + + case POW_Z_INF: + { + _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_OVERFLOW)); + __amd_handle_error(OVERFLOW, ERANGE, "pow", x, y, z); + } + break; + } + + return z; +} + +#endif /* __x86_64__ */
diff --git a/src/remainder_piby2.c b/src/remainder_piby2.c new file mode 100644 index 0000000..3f6676f --- /dev/null +++ b/src/remainder_piby2.c
@@ -0,0 +1,331 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + + +#define EXPBITS_DP64 0x7ff0000000000000 +#define EXPSHIFTBITS_DP64 52 +#define EXPBIAS_DP64 1023 +#define MANTBITS_DP64 0x000fffffffffffff +#define IMPBIT_DP64 0x0010000000000000 +#define SIGNBIT_DP64 0x8000000000000000 + + +#define GET_BITS_DP64(x, ux) \ + { \ + volatile union {double d; unsigned long long i;} _bitsy; \ + _bitsy.d = (x); \ + ux = _bitsy.i; \ + } + +#define PUT_BITS_DP64(ux, x) \ + { \ + volatile union {double d; unsigned long long i;} _bitsy; \ + _bitsy.i = (ux); \ + x = _bitsy.d; \ + } + +/* Define this to get debugging print statements activated */ +#define DEBUGGING_PRINT +#undef DEBUGGING_PRINT + + +#ifdef DEBUGGING_PRINT +#include <stdio.h> +char *d2b(int d, int bitsper, int point) +{ + static char buff[50]; + int i, j; + j = bitsper; + if (point >= 0 && point <= bitsper) + j++; + buff[j] = '\0'; + for (i = bitsper - 1; i >= 0; i--) + { + j--; + if (d % 2 == 1) + buff[j] = '1'; + else + buff[j] = '0'; + if (i == point) + { + j--; + buff[j] = '.'; + } + d /= 2; + } + return buff; +} +#endif + +/* Given positive argument x, reduce it to the range [-pi/4,pi/4] using + extra precision, and return the result in r, rr. + Return value "region" tells how many lots of pi/2 were subtracted + from x to put it in the range [-pi/4,pi/4], mod 4. */ +void __amd_remainder_piby2(double x, double *r, double *rr, int *region) +{ + + /* This method simulates multi-precision floating-point + arithmetic and is accurate for all 1 <= x < infinity */ + static const double + piby2_lead = 1.57079632679489655800e+00, /* 0x3ff921fb54442d18 */ + piby2_part1 = 1.57079631090164184570e+00, /* 0x3ff921fb50000000 */ + piby2_part2 = 1.58932547122958567343e-08, /* 0x3e5110b460000000 */ + piby2_part3 = 6.12323399573676480327e-17; /* 0x3c91a62633145c06 */ + const int bitsper = 10; + unsigned long long res[500]; + unsigned long long ux, u, carry, mask, mant, highbitsrr; + int first, last, i, rexp, xexp, resexp, ltb, determ; + double xx, t; + static unsigned long long pibits[] = + { + 0, 0, 0, 0, 0, 0, + 162, 998, 54, 915, 580, 84, 671, 777, 855, 839, + 851, 311, 448, 877, 553, 358, 316, 270, 260, 127, + 593, 398, 701, 942, 965, 390, 882, 283, 570, 265, + 221, 184, 6, 292, 750, 642, 465, 584, 463, 903, + 491, 114, 786, 617, 830, 930, 35, 381, 302, 749, + 72, 314, 412, 448, 619, 279, 894, 260, 921, 117, + 569, 525, 307, 637, 156, 529, 504, 751, 505, 160, + 945, 1022, 151, 1023, 480, 358, 15, 956, 753, 98, + 858, 41, 721, 987, 310, 507, 242, 498, 777, 733, + 244, 399, 870, 633, 510, 651, 373, 158, 940, 506, + 997, 965, 947, 833, 825, 990, 165, 164, 746, 431, + 949, 1004, 287, 565, 464, 533, 515, 193, 111, 798 + }; + + GET_BITS_DP64(x, ux); + +#ifdef DEBUGGING_PRINT + printf("On entry, x = %25.20e = %s\n", x, double2hex(&x)); +#endif + + xexp = (int)(((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64); + ux = (ux & MANTBITS_DP64) | IMPBIT_DP64; + + /* Now ux is the mantissa bit pattern of x as a long integer */ + carry = 0; + mask = 1; + mask = (mask << bitsper) - 1; + + /* Set first and last to the positions of the first + and last chunks of 2/pi that we need */ + first = xexp / bitsper; + resexp = xexp - first * bitsper; + /* 180 is the theoretical maximum number of bits (actually + 175 for IEEE double precision) that we need to extract + from the middle of 2/pi to compute the reduced argument + accurately enough for our purposes */ + last = first + 180 / bitsper; + + /* Do a long multiplication of the bits of 2/pi by the + integer mantissa */ + /* Unroll the loop. This is only correct because we know + that bitsper is fixed as 10. */ + res[19] = 0; + u = pibits[last] * ux; + res[18] = u & mask; + carry = u >> bitsper; + u = pibits[last-1] * ux + carry; + res[17] = u & mask; + carry = u >> bitsper; + u = pibits[last-2] * ux + carry; + res[16] = u & mask; + carry = u >> bitsper; + u = pibits[last-3] * ux + carry; + res[15] = u & mask; + carry = u >> bitsper; + u = pibits[last-4] * ux + carry; + res[14] = u & mask; + carry = u >> bitsper; + u = pibits[last-5] * ux + carry; + res[13] = u & mask; + carry = u >> bitsper; + u = pibits[last-6] * ux + carry; + res[12] = u & mask; + carry = u >> bitsper; + u = pibits[last-7] * ux + carry; + res[11] = u & mask; + carry = u >> bitsper; + u = pibits[last-8] * ux + carry; + res[10] = u & mask; + carry = u >> bitsper; + u = pibits[last-9] * ux + carry; + res[9] = u & mask; + carry = u >> bitsper; + u = pibits[last-10] * ux + carry; + res[8] = u & mask; + carry = u >> bitsper; + u = pibits[last-11] * ux + carry; + res[7] = u & mask; + carry = u >> bitsper; + u = pibits[last-12] * ux + carry; + res[6] = u & mask; + carry = u >> bitsper; + u = pibits[last-13] * ux + carry; + res[5] = u & mask; + carry = u >> bitsper; + u = pibits[last-14] * ux + carry; + res[4] = u & mask; + carry = u >> bitsper; + u = pibits[last-15] * ux + carry; + res[3] = u & mask; + carry = u >> bitsper; + u = pibits[last-16] * ux + carry; + res[2] = u & mask; + carry = u >> bitsper; + u = pibits[last-17] * ux + carry; + res[1] = u & mask; + carry = u >> bitsper; + u = pibits[last-18] * ux + carry; + res[0] = u & mask; + +#ifdef DEBUGGING_PRINT + printf("resexp = %d\n", resexp); + printf("Significant part of x * 2/pi with binary" + " point in correct place:\n"); + for (i = 0; i <= last - first; i++) + { + if (i > 0 && i % 5 == 0) + printf("\n "); + if (i == 1) + printf("%s ", d2b((int)res[i], bitsper, resexp)); + else + printf("%s ", d2b((int)res[i], bitsper, -1)); + } + printf("\n"); +#endif + + /* Reconstruct the result */ + ltb = (int)((((res[0] << bitsper) | res[1]) + >> (bitsper - 1 - resexp)) & 7); + + /* determ says whether the fractional part is >= 0.5 */ + determ = ltb & 1; + +#ifdef DEBUGGING_PRINT + printf("ltb = %d (last two bits before binary point" + " and first bit after)\n", ltb); + printf("determ = %d (1 means need to negate because the fractional\n" + " part of x * 2/pi is greater than 0.5)\n", determ); +#endif + + i = 1; + if (determ) + { + /* The mantissa is >= 0.5. We want to subtract it + from 1.0 by negating all the bits */ + *region = ((ltb >> 1) + 1) & 3; + mant = 1; + mant = ~(res[1]) & ((mant << (bitsper - resexp)) - 1); + while (mant < 0x0020000000000000) + { + i++; + mant = (mant << bitsper) | (~(res[i]) & mask); + } + highbitsrr = ~(res[i + 1]) << (64 - bitsper); + } + else + { + *region = (ltb >> 1); + mant = 1; + mant = res[1] & ((mant << (bitsper - resexp)) - 1); + while (mant < 0x0020000000000000) + { + i++; + mant = (mant << bitsper) | res[i]; + } + highbitsrr = res[i + 1] << (64 - bitsper); + } + + rexp = 52 + resexp - i * bitsper; + + while (mant >= 0x0020000000000000) + { + rexp++; + highbitsrr = (highbitsrr >> 1) | ((mant & 1) << 63); + mant >>= 1; + } + +#ifdef DEBUGGING_PRINT + printf("Normalised mantissa = 0x%016lx\n", mant); + printf("High bits of rest of mantissa = 0x%016lx\n", highbitsrr); + printf("Exponent to be inserted on mantissa = rexp = %d\n", rexp); +#endif + + /* Put the result exponent rexp onto the mantissa pattern */ + u = ((unsigned long long)rexp + EXPBIAS_DP64) << EXPSHIFTBITS_DP64; + ux = (mant & MANTBITS_DP64) | u; + if (determ) + /* If we negated the mantissa we negate x too */ + ux |= SIGNBIT_DP64; + PUT_BITS_DP64(ux, x); + + /* Create the bit pattern for rr */ + highbitsrr >>= 12; /* Note this is shifted one place too far */ + u = ((unsigned long long)rexp + EXPBIAS_DP64 - 53) << EXPSHIFTBITS_DP64; + PUT_BITS_DP64(u, t); + u |= highbitsrr; + PUT_BITS_DP64(u, xx); + + /* Subtract the implicit bit we accidentally added */ + xx -= t; + /* Set the correct sign, and double to account for the + "one place too far" shift */ + if (determ) + xx *= -2.0; + else + xx *= 2.0; + +#ifdef DEBUGGING_PRINT + printf("(lead part of x*2/pi) = %25.20e = %s\n", x, double2hex(&x)); + printf("(tail part of x*2/pi) = %25.20e = %s\n", xx, double2hex(&xx)); +#endif + + /* (x,xx) is an extra-precise version of the fractional part of + x * 2 / pi. Multiply (x,xx) by pi/2 in extra precision + to get the reduced argument (r,rr). */ + { + double hx, tx, c, cc; + /* Split x into hx (head) and tx (tail) */ + GET_BITS_DP64(x, ux); + ux &= 0xfffffffff8000000; + PUT_BITS_DP64(ux, hx); + tx = x - hx; + + c = piby2_lead * x; + cc = ((((piby2_part1 * hx - c) + piby2_part1 * tx) + + piby2_part2 * hx) + piby2_part2 * tx) + + (piby2_lead * xx + piby2_part3 * x); + *r = c + cc; + *rr = (c - *r) + cc; + } + +#ifdef DEBUGGING_PRINT + printf(" (r,rr) = lead and tail parts of frac(x*2/pi) * pi/2:\n"); + printf(" r = %25.20e = %s\n", *r, double2hex(r)); + printf("rr = %25.20e = %s\n", *rr, double2hex(rr)); + printf("region = (number of pi/2 subtracted from x) mod 4 = %d\n", + *region); +#endif + return; +}
diff --git a/src/remainder_piby2d2f.c b/src/remainder_piby2d2f.c new file mode 100644 index 0000000..59ed44a --- /dev/null +++ b/src/remainder_piby2d2f.c
@@ -0,0 +1,217 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + + +#define EXPBITS_DP64 0x7ff0000000000000 +#define EXPSHIFTBITS_DP64 52 +#define EXPBIAS_DP64 1023 +#define MANTBITS_DP64 0x000fffffffffffff +#define IMPBIT_DP64 0x0010000000000000 +#define SIGNBIT_DP64 0x8000000000000000 + +#define PUT_BITS_DP64(ux, x) \ + { \ + volatile union {double d; unsigned long long i;} _bitsy; \ + _bitsy.i = (ux); \ + x = _bitsy.d; \ + } + +/*Derived from static inline void __amd_remainder_piby2f_inline(unsigned long long ux, double *r, int *region) +in libm_inlines_amd.h. libm_inlines.h has the pure Windows one while libm_inlines_amd.h has the mixed one. +*/ +/* Given positive argument x, reduce it to the range [-pi/4,pi/4] using + extra precision, and return the result in r. + Return value "region" tells how many lots of pi/2 were subtracted + from x to put it in the range [-pi/4,pi/4], mod 4. */ +void __amd_remainder_piby2d2f(unsigned long long ux, double *r, int *region) +{ + /* This method simulates multi-precision floating-point + arithmetic and is accurate for all 1 <= x < infinity */ + unsigned long long u, carry, mask, mant, highbitsrr; + double dx; + unsigned long long res[500]; + int first, last, i, rexp, xexp, resexp, ltb, determ; + static const double + piby2 = 1.57079632679489655800e+00; /* 0x3ff921fb54442d18 */ + const int bitsper = 10; + static unsigned long long pibits[] = + { + 0, 0, 0, 0, 0, 0, + 162, 998, 54, 915, 580, 84, 671, 777, 855, 839, + 851, 311, 448, 877, 553, 358, 316, 270, 260, 127, + 593, 398, 701, 942, 965, 390, 882, 283, 570, 265, + 221, 184, 6, 292, 750, 642, 465, 584, 463, 903, + 491, 114, 786, 617, 830, 930, 35, 381, 302, 749, + 72, 314, 412, 448, 619, 279, 894, 260, 921, 117, + 569, 525, 307, 637, 156, 529, 504, 751, 505, 160, + 945, 1022, 151, 1023, 480, 358, 15, 956, 753, 98, + 858, 41, 721, 987, 310, 507, 242, 498, 777, 733, + 244, 399, 870, 633, 510, 651, 373, 158, 940, 506, + 997, 965, 947, 833, 825, 990, 165, 164, 746, 431, + 949, 1004, 287, 565, 464, 533, 515, 193, 111, 798 + }; + + xexp = (int)(((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64); + ux = (ux & MANTBITS_DP64) | IMPBIT_DP64; + + /* Now ux is the mantissa bit pattern of x as a long integer */ + mask = 1; + mask = (mask << bitsper) - 1; + + /* Set first and last to the positions of the first + and last chunks of 2/pi that we need */ + first = xexp / bitsper; + resexp = xexp - first * bitsper; + /* 180 is the theoretical maximum number of bits (actually + 175 for IEEE double precision) that we need to extract + from the middle of 2/pi to compute the reduced argument + accurately enough for our purposes */ + last = first + 180 / bitsper; + + /* Do a long multiplication of the bits of 2/pi by the + integer mantissa */ + /* Unroll the loop. This is only correct because we know + that bitsper is fixed as 10. */ + res[19] = 0; + u = pibits[last] * ux; + res[18] = u & mask; + carry = u >> bitsper; + u = pibits[last-1] * ux + carry; + res[17] = u & mask; + carry = u >> bitsper; + u = pibits[last-2] * ux + carry; + res[16] = u & mask; + carry = u >> bitsper; + u = pibits[last-3] * ux + carry; + res[15] = u & mask; + carry = u >> bitsper; + u = pibits[last-4] * ux + carry; + res[14] = u & mask; + carry = u >> bitsper; + u = pibits[last-5] * ux + carry; + res[13] = u & mask; + carry = u >> bitsper; + u = pibits[last-6] * ux + carry; + res[12] = u & mask; + carry = u >> bitsper; + u = pibits[last-7] * ux + carry; + res[11] = u & mask; + carry = u >> bitsper; + u = pibits[last-8] * ux + carry; + res[10] = u & mask; + carry = u >> bitsper; + u = pibits[last-9] * ux + carry; + res[9] = u & mask; + carry = u >> bitsper; + u = pibits[last-10] * ux + carry; + res[8] = u & mask; + carry = u >> bitsper; + u = pibits[last-11] * ux + carry; + res[7] = u & mask; + carry = u >> bitsper; + u = pibits[last-12] * ux + carry; + res[6] = u & mask; + carry = u >> bitsper; + u = pibits[last-13] * ux + carry; + res[5] = u & mask; + carry = u >> bitsper; + u = pibits[last-14] * ux + carry; + res[4] = u & mask; + carry = u >> bitsper; + u = pibits[last-15] * ux + carry; + res[3] = u & mask; + carry = u >> bitsper; + u = pibits[last-16] * ux + carry; + res[2] = u & mask; + carry = u >> bitsper; + u = pibits[last-17] * ux + carry; + res[1] = u & mask; + carry = u >> bitsper; + u = pibits[last-18] * ux + carry; + res[0] = u & mask; + + /* Reconstruct the result */ + ltb = (int)((((res[0] << bitsper) | res[1]) + >> (bitsper - 1 - resexp)) & 7); + + /* determ says whether the fractional part is >= 0.5 */ + determ = ltb & 1; + + i = 1; + if (determ) + { + /* The mantissa is >= 0.5. We want to subtract it + from 1.0 by negating all the bits */ + *region = ((ltb >> 1) + 1) & 3; + mant = 1; + mant = ~(res[1]) & ((mant << (bitsper - resexp)) - 1); + while (mant < 0x0020000000000000) + { + i++; + mant = (mant << bitsper) | (~(res[i]) & mask); + } + highbitsrr = ~(res[i + 1]) << (64 - bitsper); + } + else + { + *region = (ltb >> 1); + mant = 1; + mant = res[1] & ((mant << (bitsper - resexp)) - 1); + while (mant < 0x0020000000000000) + { + i++; + mant = (mant << bitsper) | res[i]; + } + highbitsrr = res[i + 1] << (64 - bitsper); + } + + rexp = 52 + resexp - i * bitsper; + + while (mant >= 0x0020000000000000) + { + rexp++; + highbitsrr = (highbitsrr >> 1) | ((mant & 1) << 63); + mant >>= 1; + } + + /* Put the result exponent rexp onto the mantissa pattern */ + u = ((unsigned long long)rexp + EXPBIAS_DP64) << EXPSHIFTBITS_DP64; + ux = (mant & MANTBITS_DP64) | u; + if (determ) + /* If we negated the mantissa we negate x too */ + ux |= SIGNBIT_DP64; + PUT_BITS_DP64(ux, dx); + + /* x is a double precision version of the fractional part of + x * 2 / pi. Multiply x by pi/2 in double precision + to get the reduced argument r. */ + *r = dx * piby2; + + return; +} + +void __remainder_piby2d2f(unsigned long ux, double *r, int *region) +{ + __amd_remainder_piby2d2f((unsigned long long) ux, r, region); +} +
diff --git a/src/rint.c b/src/rint.c new file mode 100644 index 0000000..770685f --- /dev/null +++ b/src/rint.c
@@ -0,0 +1,69 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "libm_amd.h" +#include "libm_util_amd.h" + + + +double FN_PROTOTYPE(rint)(double x) +{ + + UT64 checkbits,val_2p52; + UT32 sign; + checkbits.f64=x; + + /* Clear the sign bit and check if the value can be rounded(i.e check if exponent less than 52) */ + if( (checkbits.u64 & 0x7FFFFFFFFFFFFFFF) > 0x4330000000000000) + { + /* take care of nan or inf */ + if((checkbits.u32[1] & 0x7ff00000)== 0x7ff00000) + return x+x; + else + return x; + } + + sign.u32 = checkbits.u32[1] & 0x80000000; + val_2p52.u32[1] = sign.u32 | 0x43300000; + val_2p52.u32[0] = 0; + + /* Add and sub 2^52 to round the number according to the current rounding direction */ + val_2p52.f64 = (x + val_2p52.f64) - val_2p52.f64; + + /*This extra line is to take care of denormals and various rounding modes*/ + val_2p52.u32[1] = ((val_2p52.u32[1] << 1) >> 1) | sign.u32; + + if(x!=val_2p52.f64) + { + /* Raise floating-point inexact exception if the result differs in value from the argument */ + checkbits.u64 = QNANBITPATT_DP64; + checkbits.f64 = checkbits.f64 + checkbits.f64; /* raise inexact exception by adding two nan numbers.*/ + } + + + return (val_2p52.f64); +} + + + +
diff --git a/src/rintf.c b/src/rintf.c new file mode 100644 index 0000000..e048c11 --- /dev/null +++ b/src/rintf.c
@@ -0,0 +1,65 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "libm_amd.h" +#include "libm_util_amd.h" + + + +float FN_PROTOTYPE(rintf)(float x) +{ + + UT32 checkbits,sign,val_2p23; + checkbits.f32=x; + + /* Clear the sign bit and check if the value can be rounded */ + if( (checkbits.u32 & 0x7FFFFFFF) > 0x4B000000) + { + /* Number exceeds the representable range could be nan or inf also*/ + /* take care of nan or inf */ + if((checkbits.u32 & 0x7f800000)== 0x7f800000) + return x+x; + else + return x; + } + + sign.u32 = checkbits.u32 & 0x80000000; + val_2p23.u32 = (checkbits.u32 & 0x80000000) | 0x4B000000; + + /* Add and sub 2^23 to round the number according to the current rounding direction */ + val_2p23.f32 = ((x + val_2p23.f32) - val_2p23.f32); + + /*This extra line is to take care of denormals and various rounding modes*/ + val_2p23.u32 = ((val_2p23.u32 << 1) >> 1) | sign.u32; + + if (val_2p23.f32 != x) + { + /* Raise floating-point inexact exception if the result differs in value from the argument */ + checkbits.u32 = 0xFFC00000; + checkbits.f32 = checkbits.f32 + checkbits.f32; /* raise inexact exception by adding two nan numbers.*/ + } + + + return val_2p23.f32; +} +
diff --git a/src/roundf.c b/src/roundf.c new file mode 100644 index 0000000..596c381 --- /dev/null +++ b/src/roundf.c
@@ -0,0 +1,97 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" +#include "../inc/libm_special.h" + +float FN_PROTOTYPE(roundf)(float f) +{ + UT32 u32f, u32Temp; + U32 u32sign, u32exp, u32mantissa; + int intexp; /*Needs to be signed */ + u32f.f32 = f; + u32sign = u32f.u32 & SIGNBIT_SP32; + if ((u32f.u32 & 0X7F800000) == 0x7F800000) + { + //u32f.f32 = f; + /*Return Quiet Nan. + * Quiet the signalling nan*/ + if(!((u32f.u32 & MANTBITS_SP32) == 0)) + u32f.u32 |= QNAN_MASK_32; + /*else the number is infinity*/ + //Raise range or domain error + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = f; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, EDOM, "roundf", f, is_x_snan, 0.0F , 0,u32f.f32, 0); + } + + + return u32f.f32; + } + /*Get the exponent of the input*/ + intexp = (u32f.u32 & 0x7f800000) >> 23; + intexp -= 0x7F; + /*If exponent is greater than 22 then the number is already + rounded*/ + if (intexp > 22) + return f; + if (intexp < 0) + { + u32Temp.f32 = f; + u32Temp.u32 &= 0x7FFFFFFF; + /*Add with a large number (2^23 +1) = 8388609.0F + to force an overflow*/ + u32Temp.f32 = (u32Temp.f32 + 8388609.0F); + /*Substract back with t he large number*/ + u32Temp.f32 -= 8388609; + if (u32sign) + u32Temp.u32 |= 0x80000000; + return u32Temp.f32; + } + else + { + /*if(intexp == -1) + u32exp = 0x3F800000; */ + u32f.u32 &= 0x7FFFFFFF; + u32f.f32 += 0.5; + u32exp = u32f.u32 & 0x7F800000; + /*right shift then left shift to discard the decimal + places*/ + u32mantissa = (u32f.u32 & MANTBITS_SP32) >> (23 - intexp); + u32mantissa = u32mantissa << (23 - intexp); + u32Temp.u32 = u32sign | u32exp | u32mantissa; + return (u32Temp.f32); + } +} +
diff --git a/src/scalbln.c b/src/scalbln.c new file mode 100644 index 0000000..51499d8 --- /dev/null +++ b/src/scalbln.c
@@ -0,0 +1,119 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include "../inc/libm_special.h" + + +double FN_PROTOTYPE(scalbln)(double x, long int n) +{ + UT64 val; + unsigned int sign; + int exponent; + val.f64 = x; + sign = val.u32[1] & 0x80000000; + val.u32[1] = val.u32[1] & 0x7fffffff; /* remove the sign bit */ + + if((val.u32[1] & 0x7ff00000)== 0x7ff00000)/* x= nan or x = +-inf*/ + return x+x; + + if((val.u64 == 0x0000000000000000) || (n==0)) + return x; /* x= +-0 or n= 0*/ + + exponent = val.u32[1] >> 20; /* get the exponent */ + + if(exponent == 0)/*x is denormal*/ + { + val.f64 = val.f64 * VAL_2PMULTIPLIER_DP;/*multiply by 2^53 to bring it to the normal range*/ + exponent = val.u32[1] >> 20; /* get the exponent */ + exponent = exponent + n - MULTIPLIER_DP; + if(exponent < -MULTIPLIER_DP)/*underflow*/ + { + val.u32[1] = sign | 0x00000000; + val.u32[0] = 0x00000000; + + __amd_handle_error(UNDERFLOW, ERANGE, "scalbln", x, (double)n ,val.f64); + + return val.f64; + } + if(exponent > 2046)/*overflow*/ + { + val.u32[1] = sign | 0x7ff00000; + val.u32[0] = 0x00000000; + + __amd_handle_error(OVERFLOW, ERANGE, "scalbln", x, (double)n ,val.f64); + + + return val.f64; + } + + exponent += MULTIPLIER_DP; + val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff); + val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP; + return val.f64; + } + + exponent += n; + + if(exponent < -MULTIPLIER_DP)/*underflow*/ + { + val.u32[1] = sign | 0x00000000; + val.u32[0] = 0x00000000; + + __amd_handle_error(UNDERFLOW, ERANGE, "scalbln", x, (double)n ,val.f64); + + return val.f64; + } + + if(exponent < 1)/*x is normal but output is debnormal*/ + { + exponent += MULTIPLIER_DP; + val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff); + val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP; + return val.f64; + } + + if(exponent > 2046)/*overflow*/ + { + val.u32[1] = sign | 0x7ff00000; + val.u32[0] = 0x00000000; + + __amd_handle_error(OVERFLOW, ERANGE, "scalbln", x, (double)n ,val.f64); + + + return val.f64; + } + + val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff); + return val.f64; +} +
diff --git a/src/scalblnf.c b/src/scalblnf.c new file mode 100644 index 0000000..cc627bb --- /dev/null +++ b/src/scalblnf.c
@@ -0,0 +1,133 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> +#include "../inc/libm_amd.h" + +#include "../inc/libm_util_amd.h" + +#include "../inc/libm_special.h" + +float FN_PROTOTYPE(scalblnf)(float x, long int n) +{ + UT32 val; + unsigned int sign; + int exponent; + val.f32 = x; + sign = val.u32 & 0x80000000; + val.u32 = val.u32 & 0x7fffffff;/* remove the sign bit */ + + if((val.u32 & 0x7f800000)== 0x7f800000)/* x= nan or x = +-inf*/ + return x+x; + + if((val.u32 == 0x00000000) || (n==0))/* x= +-0 or n= 0*/ + return x; + + exponent = val.u32 >> 23; /* get the exponent */ + + if(exponent == 0)/*x is denormal*/ + { + val.f32 = val.f32 * VAL_2PMULTIPLIER_SP;/*multiply by 2^24 to bring it to the normal range*/ + exponent = (val.u32 >> 23); /* get the exponent */ + exponent = exponent + n - MULTIPLIER_SP; + if(exponent < -MULTIPLIER_SP)/*underflow*/ + { + val.u32 = sign | 0x00000000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(UNDERFLOW, ERANGE, "scalblnf", x, is_x_snan, (float)n, 0,val.f32, 0); + } + + return val.f32; + } + if(exponent > 254)/*overflow*/ + { + val.u32 = sign | 0x7f800000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(OVERFLOW, ERANGE, "scalblnf", x, is_x_snan, (float)n, 0,val.f32, 0); + } + + return val.f32; + } + + exponent += MULTIPLIER_SP; + val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff); + val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP; + return val.f32; + } + + exponent += n; + + if(exponent < -MULTIPLIER_SP)/*underflow*/ + { + val.u32 = sign | 0x00000000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(UNDERFLOW, ERANGE, "scalblnf", x, is_x_snan, (float)n, 0,val.f32, 0); + } + + return val.f32; + } + + if(exponent < 1)/*x is normal but output is debnormal*/ + { + exponent += MULTIPLIER_SP; + val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff); + val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP; + return val.f32; + } + + if(exponent > 254)/*overflow*/ + { + val.u32 = sign | 0x7f800000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(OVERFLOW, ERANGE, "scalblnf", x, is_x_snan, (float)n, 0,val.f32, 0); + } + + return val.f32; + } + + val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);/*x is normal and output is normal*/ + return val.f32; +} +
diff --git a/src/scalbn.c b/src/scalbn.c new file mode 100644 index 0000000..facb718 --- /dev/null +++ b/src/scalbn.c
@@ -0,0 +1,117 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include "../inc/libm_special.h" + + + +double FN_PROTOTYPE(scalbn)(double x, int n) +{ + UT64 val; + unsigned int sign; + int exponent; + val.f64 = x; + sign = val.u32[1] & 0x80000000; + val.u32[1] = val.u32[1] & 0x7fffffff; /* remove the sign bit */ + + if((val.u32[1] & 0x7ff00000)== 0x7ff00000)/* x= nan or x = +-inf*/ + return x+x; + + if((val.u64 == 0x0000000000000000) || (n==0)) + return x; /* x= +-0 or n= 0*/ + + exponent = val.u32[1] >> 20; /* get the exponent */ + + if(exponent == 0)/*x is denormal*/ + { + val.f64 = val.f64 * VAL_2PMULTIPLIER_DP;/*multiply by 2^53 to bring it to the normal range*/ + exponent = val.u32[1] >> 20; /* get the exponent */ + exponent = exponent + n - MULTIPLIER_DP; + if(exponent < -MULTIPLIER_DP)/*underflow*/ + { + val.u32[1] = sign | 0x00000000; + val.u32[0] = 0x00000000; + __amd_handle_error(UNDERFLOW, ERANGE, "scalbn", x, (double) n , val.f64); + + return val.f64; + } + if(exponent > 2046)/*overflow*/ + { + val.u32[1] = sign | 0x7ff00000; + val.u32[0] = 0x00000000; + + __amd_handle_error(OVERFLOW, ERANGE, "scalbn", x, (double) n , val.f64); + + return val.f64; + } + + exponent += MULTIPLIER_DP; + val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff); + val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP; + return val.f64; + } + + exponent += n; + + if(exponent < -MULTIPLIER_DP)/*underflow*/ + { + val.u32[1] = sign | 0x00000000; + val.u32[0] = 0x00000000; + + __amd_handle_error(UNDERFLOW, ERANGE, "scalbn", x, (double) n , val.f64); + + return val.f64; + } + + if(exponent < 1)/*x is normal but output is debnormal*/ + { + exponent += MULTIPLIER_DP; + val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff); + val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP; + return val.f64; + } + + if(exponent > 2046)/*overflow*/ + { + val.u32[1] = sign | 0x7ff00000; + val.u32[0] = 0x00000000; + + __amd_handle_error(OVERFLOW, ERANGE, "scalbn", x, (double) n , val.f64); + + return val.f64; + } + + val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff); + return val.f64; +} +
diff --git a/src/scalbnf.c b/src/scalbnf.c new file mode 100644 index 0000000..1477fe1 --- /dev/null +++ b/src/scalbnf.c
@@ -0,0 +1,138 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif + +#include <math.h> +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#include "../inc/libm_special.h" + +float FN_PROTOTYPE(scalbnf)(float x, int n) +{ + UT32 val; + unsigned int sign; + int exponent; + val.f32 = x; + sign = val.u32 & 0x80000000; + val.u32 = val.u32 & 0x7fffffff;/* remove the sign bit */ + + if((val.u32 & 0x7f800000)== 0x7f800000)/* x= nan or x = +-inf*/ + return x+x; + + if((val.u32 == 0x00000000) || (n==0))/* x= +-0 or n= 0*/ + return x; + + exponent = val.u32 >> 23; /* get the exponent */ + + if(exponent == 0)/*x is denormal*/ + { + val.f32 = val.f32 * VAL_2PMULTIPLIER_SP;/*multiply by 2^24 to bring it to the normal range*/ + exponent = (val.u32 >> 23); /* get the exponent */ + exponent = exponent + n - MULTIPLIER_SP; + if(exponent < -MULTIPLIER_SP)/*underflow*/ + { + val.u32 = sign | 0x00000000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(UNDERFLOW, ERANGE, "scalbnf", x, is_x_snan, (float) n , 0, val.f32, 0); + + } + + return val.f32; + } + if(exponent > 254)/*overflow*/ + { + val.u32 = sign | 0x7f800000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(OVERFLOW, ERANGE, "scalbnf", x, is_x_snan, (float) n , 0, val.f32, 0); + + } + + + return val.f32; + } + + exponent += MULTIPLIER_SP; + val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff); + val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP; + return val.f32; + } + + exponent += n; + + if(exponent < -MULTIPLIER_SP)/*underflow*/ + { + val.u32 = sign | 0x00000000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(UNDERFLOW, ERANGE, "scalbnf", x, is_x_snan, (float) n , 0, val.f32, 0); + + } + + + return val.f32; + } + + if(exponent < 1)/*x is normal but output is debnormal*/ + { + exponent += MULTIPLIER_SP; + val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff); + val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP; + return val.f32; + } + + if(exponent > 254)/*overflow*/ + { + val.u32 = sign | 0x7f800000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(OVERFLOW, ERANGE, "scalbnf", x, is_x_snan, (float) n , 0, val.f32, 0); + + } + + return val.f32; + } + + val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);/*x is normal and output is normal*/ + return val.f32; +} +
diff --git a/src/sincos_special.c b/src/sincos_special.c new file mode 100644 index 0000000..c349d10 --- /dev/null +++ b/src/sincos_special.c
@@ -0,0 +1,151 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include <emmintrin.h> +#include <math.h> +#include <errno.h> + + +#include "../inc/libm_util_amd.h" +#include "../inc/libm_special.h" + +double _sin_cos_special(double x, const char *name) +{ + UT64 xu; + unsigned int is_snan; + + xu.f64 = x; + + if((xu.u64 & EXPBITS_DP64) == EXPBITS_DP64) + { + // x is Inf or NaN + if((xu.u64 & MANTBITS_DP64) == 0x0) + { + // x is Inf + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); +#ifdef WIN64 + xu.u64 = INDEFBITPATT_DP64; + __amd_handle_error(DOMAIN, EDOM, name, x, 0, xu.f64); +#else + xu.u64 = QNANBITPATT_DP64; + name = *(&name); // dummy statement to avoid warning +#endif + } + else { + // x is NaN + is_snan = (((xu.u64 & QNAN_MASK_64) == QNAN_MASK_64) ? 0 : 1); + if(is_snan){ + xu.u64 |= QNAN_MASK_64; +#ifdef WIN64 +#else + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); +#endif + } +#ifdef WIN64 + __amd_handle_error(DOMAIN, EDOM, name, x, 0, xu.f64); +#endif + } + + } + + return xu.f64; +} + +float _sinf_cosf_special(float x, const char *name) +{ + UT32 xu; + unsigned int is_snan; + + xu.f32 = x; + + if((xu.u32 & EXPBITS_SP32) == EXPBITS_SP32) + { + // x is Inf or NaN + if((xu.u32 & MANTBITS_SP32) == 0x0) + { + // x is Inf + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); +#ifdef WIN64 + xu.u32 = INDEFBITPATT_SP32; + __amd_handle_errorf(DOMAIN, EDOM, name, x, 0, 0.0f, 0, xu.f32, 0); +#else + xu.u32 = QNANBITPATT_SP32; + name = *(&name); // dummy statement to avoid warning +#endif + } + else { + // x is NaN + is_snan = (((xu.u32 & QNAN_MASK_32) == QNAN_MASK_32) ? 0 : 1); + if(is_snan) { + xu.u32 |= QNAN_MASK_32; + _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); + } +#ifdef WIN64 + __amd_handle_errorf(DOMAIN, EDOM, name, x, is_snan, 0.0f, 0, xu.f32, 0); +#endif + } + + } + + return xu.f32; +} + +float _sinf_special(float x) +{ + return _sinf_cosf_special(x, "sinf"); +} + +double _sin_special(double x) +{ + return _sin_cos_special(x, "sin"); +} + +float _cosf_special(float x) +{ + return _sinf_cosf_special(x, "cosf"); +} + +double _cos_special(double x) +{ + return _sin_cos_special(x, "cos"); +} + +void _sincosf_special(float x, float *sy, float *cy) +{ + float xu = _sinf_cosf_special(x, "sincosf"); + + *sy = xu; + *cy = xu; + + return; +} + +void _sincos_special(double x, double *sy, double *cy) +{ + double xu = _sin_cos_special(x, "sincos"); + + *sy = xu; + *cy = xu; + + return; +}
diff --git a/src/sinh.c b/src/sinh.c new file mode 100644 index 0000000..f22fee4 --- /dev/null +++ b/src/sinh.c
@@ -0,0 +1,371 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_SPLITEXP +#define USE_SCALEDOUBLE_1 +#define USE_SCALEDOUBLE_2 +#define USE_INFINITY_WITH_FLAGS +#define USE_VAL_WITH_FLAGS +#define USE_HANDLE_ERROR +#include "../inc/libm_inlines_amd.h" +#undef USE_HANDLE_ERROR +#undef USE_SPLITEXP +#undef USE_SCALEDOUBLE_1 +#undef USE_SCALEDOUBLE_2 +#undef USE_INFINITY_WITH_FLAGS +#undef USE_VAL_WITH_FLAGS + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS + +/* Deal with errno for out-of-range result */ +static inline double retval_errno_erange(double x, int xneg) +{ + struct exception exc; + exc.arg1 = x; + exc.arg2 = x; + exc.type = OVERFLOW; + exc.name = (char *)"sinh"; + if (_LIB_VERSION == _SVID_) + { + if (xneg) + exc.retval = -HUGE; + else + exc.retval = HUGE; + } + else + { + if (xneg) + exc.retval = -infinity_with_flags(AMD_F_OVERFLOW); + else + exc.retval = infinity_with_flags(AMD_F_OVERFLOW); + } + if (_LIB_VERSION == _POSIX_) + __set_errno(ERANGE); + else if (!matherr(&exc)) + __set_errno(ERANGE); + return exc.retval; +} +#endif + +double FN_PROTOTYPE(sinh)(double x) +{ + /* + After dealing with special cases the computation is split into + regions as follows: + + abs(x) >= max_sinh_arg: + sinh(x) = sign(x)*Inf + + abs(x) >= small_threshold: + sinh(x) = sign(x)*exp(abs(x))/2 computed using the + splitexp and scaleDouble functions as for exp_amd(). + + abs(x) < small_threshold: + compute p = exp(y) - 1 and then z = 0.5*(p+(p/(p+1.0))) + sinh(x) is then sign(x)*z. */ + + static const double + max_sinh_arg = 7.10475860073943977113e+02, /* 0x408633ce8fb9f87e */ + thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */ + log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */ + log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */ + small_threshold = 8*BASEDIGITS_DP64*0.30102999566398119521373889; + /* (8*BASEDIGITS_DP64*log10of2) ' exp(-x) insignificant c.f. exp(x) */ + + /* Lead and tail tabulated values of sinh(i) and cosh(i) + for i = 0,...,36. The lead part has 26 leading bits. */ + + static const double sinh_lead[37] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 1.17520117759704589844e+00, /* 0x3ff2cd9fc0000000 */ + 3.62686038017272949219e+00, /* 0x400d03cf60000000 */ + 1.00178747177124023438e+01, /* 0x40240926e0000000 */ + 2.72899169921875000000e+01, /* 0x403b4a3800000000 */ + 7.42032089233398437500e+01, /* 0x40528d0160000000 */ + 2.01713153839111328125e+02, /* 0x406936d228000000 */ + 5.48316116333007812500e+02, /* 0x4081228768000000 */ + 1.49047882080078125000e+03, /* 0x409749ea50000000 */ + 4.05154187011718750000e+03, /* 0x40afa71570000000 */ + 1.10132326660156250000e+04, /* 0x40c5829dc8000000 */ + 2.99370708007812500000e+04, /* 0x40dd3c4488000000 */ + 8.13773945312500000000e+04, /* 0x40f3de1650000000 */ + 2.21206695312500000000e+05, /* 0x410b00b590000000 */ + 6.01302140625000000000e+05, /* 0x412259ac48000000 */ + 1.63450865625000000000e+06, /* 0x4138f0cca8000000 */ + 4.44305525000000000000e+06, /* 0x4150f2ebd0000000 */ + 1.20774762500000000000e+07, /* 0x4167093488000000 */ + 3.28299845000000000000e+07, /* 0x417f4f2208000000 */ + 8.92411500000000000000e+07, /* 0x419546d8f8000000 */ + 2.42582596000000000000e+08, /* 0x41aceb0888000000 */ + 6.59407856000000000000e+08, /* 0x41c3a6e1f8000000 */ + 1.79245641600000000000e+09, /* 0x41dab5adb8000000 */ + 4.87240166400000000000e+09, /* 0x41f226af30000000 */ + 1.32445608960000000000e+10, /* 0x4208ab7fb0000000 */ + 3.60024494080000000000e+10, /* 0x4220c3d390000000 */ + 9.78648043520000000000e+10, /* 0x4236c93268000000 */ + 2.66024116224000000000e+11, /* 0x424ef822f0000000 */ + 7.23128516608000000000e+11, /* 0x42650bba30000000 */ + 1.96566712320000000000e+12, /* 0x427c9aae40000000 */ + 5.34323724288000000000e+12, /* 0x4293704708000000 */ + 1.45244246507520000000e+13, /* 0x42aa6b7658000000 */ + 3.94814795284480000000e+13, /* 0x42c1f43fc8000000 */ + 1.07321789251584000000e+14, /* 0x42d866f348000000 */ + 2.91730863685632000000e+14, /* 0x42f0953e28000000 */ + 7.93006722514944000000e+14, /* 0x430689e220000000 */ + 2.15561576592179200000e+15}; /* 0x431ea215a0000000 */ + + static const double sinh_tail[37] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 1.60467555584448807892e-08, /* 0x3e513ae6096a0092 */ + 2.76742892754807136947e-08, /* 0x3e5db70cfb79a640 */ + 2.09697499555224576530e-07, /* 0x3e8c2526b66dc067 */ + 2.04940252448908240062e-07, /* 0x3e8b81b18647f380 */ + 1.65444891522700935932e-06, /* 0x3ebbc1cdd1e1eb08 */ + 3.53116789999998198721e-06, /* 0x3ecd9f201534fb09 */ + 6.94023870987375490695e-06, /* 0x3edd1c064a4e9954 */ + 4.98876893611587449271e-06, /* 0x3ed4eca65d06ea74 */ + 3.19656024605152215752e-05, /* 0x3f00c259bcc0ecc5 */ + 2.08687768377236501204e-04, /* 0x3f2b5a6647cf9016 */ + 4.84668088325403796299e-05, /* 0x3f09691adefb0870 */ + 1.17517985422733832468e-03, /* 0x3f53410fc29cde38 */ + 6.90830086959560562415e-04, /* 0x3f46a31a50b6fb3c */ + 1.45697262451506548420e-03, /* 0x3f57defc71805c40 */ + 2.99859023684906737806e-02, /* 0x3f9eb49fd80e0bab */ + 1.02538800507941396667e-02, /* 0x3f84fffc7bcd5920 */ + 1.26787628407699110022e-01, /* 0x3fc03a93b6c63435 */ + 6.86652479544033744752e-02, /* 0x3fb1940bb255fd1c */ + 4.81593627621056619148e-01, /* 0x3fded26e14260b50 */ + 1.70489513795397629181e+00, /* 0x3ffb47401fc9f2a2 */ + 1.12416073482258713767e+01, /* 0x40267bb3f55634f1 */ + 7.06579578070110514432e+00, /* 0x401c435ff8194ddc */ + 5.91244512999659974639e+01, /* 0x404d8fee052ba63a */ + 1.68921736147050694399e+02, /* 0x40651d7edccde3f6 */ + 2.60692936262073658327e+02, /* 0x40704b1644557d1a */ + 3.62419382134885609048e+02, /* 0x4076a6b5ca0a9dc4 */ + 4.07689930834187271103e+03, /* 0x40afd9cc72249aba */ + 1.55377375868385224749e+04, /* 0x40ce58de693edab5 */ + 2.53720210371943067003e+04, /* 0x40d8c70158ac6363 */ + 4.78822310734952334315e+04, /* 0x40e7614764f43e20 */ + 1.81871712615542812273e+05, /* 0x4106337db36fc718 */ + 5.62892347580489004031e+05, /* 0x41212d98b1f611e2 */ + 6.41374032312148716301e+05, /* 0x412392bc108b37cc */ + 7.57809544070145115256e+06, /* 0x415ce87bdc3473dc */ + 3.64177136406482197344e+06, /* 0x414bc8d5ae99ad14 */ + 7.63580561355670914054e+06}; /* 0x415d20d76744835c */ + + static const double cosh_lead[37] = { + 1.00000000000000000000e+00, /* 0x3ff0000000000000 */ + 1.54308062791824340820e+00, /* 0x3ff8b07550000000 */ + 3.76219564676284790039e+00, /* 0x400e18fa08000000 */ + 1.00676617622375488281e+01, /* 0x402422a490000000 */ + 2.73082327842712402344e+01, /* 0x403b4ee858000000 */ + 7.42099475860595703125e+01, /* 0x40528d6fc8000000 */ + 2.01715633392333984375e+02, /* 0x406936e678000000 */ + 5.48317031860351562500e+02, /* 0x4081228948000000 */ + 1.49047915649414062500e+03, /* 0x409749eaa8000000 */ + 4.05154199218750000000e+03, /* 0x40afa71580000000 */ + 1.10132329101562500000e+04, /* 0x40c5829dd0000000 */ + 2.99370708007812500000e+04, /* 0x40dd3c4488000000 */ + 8.13773945312500000000e+04, /* 0x40f3de1650000000 */ + 2.21206695312500000000e+05, /* 0x410b00b590000000 */ + 6.01302140625000000000e+05, /* 0x412259ac48000000 */ + 1.63450865625000000000e+06, /* 0x4138f0cca8000000 */ + 4.44305525000000000000e+06, /* 0x4150f2ebd0000000 */ + 1.20774762500000000000e+07, /* 0x4167093488000000 */ + 3.28299845000000000000e+07, /* 0x417f4f2208000000 */ + 8.92411500000000000000e+07, /* 0x419546d8f8000000 */ + 2.42582596000000000000e+08, /* 0x41aceb0888000000 */ + 6.59407856000000000000e+08, /* 0x41c3a6e1f8000000 */ + 1.79245641600000000000e+09, /* 0x41dab5adb8000000 */ + 4.87240166400000000000e+09, /* 0x41f226af30000000 */ + 1.32445608960000000000e+10, /* 0x4208ab7fb0000000 */ + 3.60024494080000000000e+10, /* 0x4220c3d390000000 */ + 9.78648043520000000000e+10, /* 0x4236c93268000000 */ + 2.66024116224000000000e+11, /* 0x424ef822f0000000 */ + 7.23128516608000000000e+11, /* 0x42650bba30000000 */ + 1.96566712320000000000e+12, /* 0x427c9aae40000000 */ + 5.34323724288000000000e+12, /* 0x4293704708000000 */ + 1.45244246507520000000e+13, /* 0x42aa6b7658000000 */ + 3.94814795284480000000e+13, /* 0x42c1f43fc8000000 */ + 1.07321789251584000000e+14, /* 0x42d866f348000000 */ + 2.91730863685632000000e+14, /* 0x42f0953e28000000 */ + 7.93006722514944000000e+14, /* 0x430689e220000000 */ + 2.15561576592179200000e+15}; /* 0x431ea215a0000000 */ + + static const double cosh_tail[37] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 6.89700037027478056904e-09, /* 0x3e3d9f5504c2bd28 */ + 4.43207835591715833630e-08, /* 0x3e67cb66f0a4c9fd */ + 2.33540217013828929694e-07, /* 0x3e8f58617928e588 */ + 5.17452463948269748331e-08, /* 0x3e6bc7d000c38d48 */ + 9.38728274131605919153e-07, /* 0x3eaf7f9d4e329998 */ + 2.73012191010840495544e-06, /* 0x3ec6e6e464885269 */ + 3.29486051438996307950e-06, /* 0x3ecba3a8b946c154 */ + 4.75803746362771416375e-06, /* 0x3ed3f4e76110d5a4 */ + 3.33050940471947692369e-05, /* 0x3f017622515a3e2b */ + 9.94707313972136215365e-06, /* 0x3ee4dc4b528af3d0 */ + 6.51685096227860253398e-05, /* 0x3f11156278615e10 */ + 1.18132406658066663359e-03, /* 0x3f535ad50ed821f5 */ + 6.93090416366541877541e-04, /* 0x3f46b61055f2935c */ + 1.45780415323416845386e-03, /* 0x3f57e2794a601240 */ + 2.99862082708111758744e-02, /* 0x3f9eb4b45f6aadd3 */ + 1.02539925859688602072e-02, /* 0x3f85000b967b3698 */ + 1.26787669807076286421e-01, /* 0x3fc03a940fadc092 */ + 6.86652631843830962843e-02, /* 0x3fb1940bf3bf874c */ + 4.81593633223853068159e-01, /* 0x3fded26e1a2a2110 */ + 1.70489514001513020602e+00, /* 0x3ffb4740205796d6 */ + 1.12416073489841270572e+01, /* 0x40267bb3f55cb85d */ + 7.06579578098005001152e+00, /* 0x401c435ff81e18ac */ + 5.91244513000686140458e+01, /* 0x404d8fee052bdea4 */ + 1.68921736147088438429e+02, /* 0x40651d7edccde926 */ + 2.60692936262087528121e+02, /* 0x40704b1644557e0e */ + 3.62419382134890611269e+02, /* 0x4076a6b5ca0a9e1c */ + 4.07689930834187453002e+03, /* 0x40afd9cc72249abe */ + 1.55377375868385224749e+04, /* 0x40ce58de693edab5 */ + 2.53720210371943103382e+04, /* 0x40d8c70158ac6364 */ + 4.78822310734952334315e+04, /* 0x40e7614764f43e20 */ + 1.81871712615542812273e+05, /* 0x4106337db36fc718 */ + 5.62892347580489004031e+05, /* 0x41212d98b1f611e2 */ + 6.41374032312148716301e+05, /* 0x412392bc108b37cc */ + 7.57809544070145115256e+06, /* 0x415ce87bdc3473dc */ + 3.64177136406482197344e+06, /* 0x414bc8d5ae99ad14 */ + 7.63580561355670914054e+06}; /* 0x415d20d76744835c */ + + unsigned long long ux, aux, xneg; + double y, z, z1, z2; + int m; + + /* Special cases */ + + GET_BITS_DP64(x, ux); + aux = ux & ~SIGNBIT_DP64; + if (aux < 0x3e30000000000000) /* |x| small enough that sinh(x) = x */ + { + if (aux == 0) + /* with no inexact */ + return x; + else + return val_with_flags(x, AMD_F_INEXACT); + } + else if (aux >= 0x7ff0000000000000) /* |x| is NaN or Inf */ + { + return x + x; + } + + + xneg = (aux != ux); + + y = x; + if (xneg) y = -x; + + if (y >= max_sinh_arg) + { + /* Return +/-infinity with overflow flag */ + +#ifdef WINDOWS + if (xneg) + return handle_error("sinh", NINFBITPATT_DP64, _OVERFLOW, + AMD_F_OVERFLOW, EDOM, x, 0.0F); + else + return handle_error("sinh", PINFBITPATT_DP64, _OVERFLOW, + AMD_F_OVERFLOW, ERANGE, x, 0.0F); +#else + return retval_errno_erange(x, xneg); +#endif + } + else if (y >= small_threshold) + { + /* In this range y is large enough so that + the negative exponential is negligible, + so sinh(y) is approximated by sign(x)*exp(y)/2. The + code below is an inlined version of that from + exp() with two changes (it operates on + y instead of x, and the division by 2 is + done by reducing m by 1). */ + + splitexp(y, 1.0, thirtytwo_by_log2, log2_by_32_lead, + log2_by_32_tail, &m, &z1, &z2); + m -= 1; + + if (m >= EMIN_DP64 && m <= EMAX_DP64) + z = scaleDouble_1((z1+z2),m); + else + z = scaleDouble_2((z1+z2),m); + } + else + { + /* In this range we find the integer part y0 of y + and the increment dy = y - y0. We then compute + + z = sinh(y) = sinh(y0)cosh(dy) + cosh(y0)sinh(dy) + + where sinh(y0) and cosh(y0) are tabulated above. */ + + int ind; + double dy, dy2, sdy, cdy, sdy1, sdy2; + + ind = (int)y; + dy = y - ind; + + dy2 = dy*dy; + sdy = dy*dy2*(0.166666666666666667013899e0 + + (0.833333333333329931873097e-2 + + (0.198412698413242405162014e-3 + + (0.275573191913636406057211e-5 + + (0.250521176994133472333666e-7 + + (0.160576793121939886190847e-9 + + 0.7746188980094184251527126e-12*dy2)*dy2)*dy2)*dy2)*dy2)*dy2); + + cdy = dy2*(0.500000000000000005911074e0 + + (0.416666666666660876512776e-1 + + (0.138888888889814854814536e-2 + + (0.248015872460622433115785e-4 + + (0.275573350756016588011357e-6 + + (0.208744349831471353536305e-8 + + 0.1163921388172173692062032e-10*dy2)*dy2)*dy2)*dy2)*dy2)*dy2); + + /* At this point sinh(dy) is approximated by dy + sdy. + Shift some significant bits from dy to sdy. */ + + GET_BITS_DP64(dy, ux); + ux &= 0xfffffffff8000000; + PUT_BITS_DP64(ux, sdy1); + sdy2 = sdy + (dy - sdy1); + + z = ((((((cosh_tail[ind]*sdy2 + sinh_tail[ind]*cdy) + + cosh_tail[ind]*sdy1) + sinh_tail[ind]) + + cosh_lead[ind]*sdy2) + sinh_lead[ind]*cdy) + + cosh_lead[ind]*sdy1) + sinh_lead[ind]; + } + + if (xneg) z = - z; + return z; +} + +weak_alias (__sinh, sinh)
diff --git a/src/sinhf.c b/src/sinhf.c new file mode 100644 index 0000000..eaad0fd --- /dev/null +++ b/src/sinhf.c
@@ -0,0 +1,292 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_SPLITEXP +#define USE_SCALEDOUBLE_1 +#define USE_INFINITY_WITH_FLAGS +#define USE_VALF_WITH_FLAGS +#define USE_HANDLE_ERRORF +#include "../inc/libm_inlines_amd.h" +#undef USE_SPLITEXP +#undef USE_SCALEDOUBLE_1 +#undef USE_INFINITY_WITH_FLAGS +#undef USE_VALF_WITH_FLAGS +#undef USE_HANDLE_ERRORF + +#include "../inc/libm_errno_amd.h" + +#ifndef WINDOWS +/* Deal with errno for out-of-range result */ +static inline float retval_errno_erange(float x, int xneg) +{ + struct exception exc; + exc.arg1 = (double)x; + exc.arg2 = (double)x; + exc.type = OVERFLOW; + exc.name = (char *)"sinhf"; + if (_LIB_VERSION == _SVID_) + { + if (xneg) + exc.retval = -HUGE; + else + exc.retval = HUGE; + } + else + { + if (xneg) + exc.retval = -infinity_with_flags(AMD_F_OVERFLOW); + else + exc.retval = infinity_with_flags(AMD_F_OVERFLOW); + } + if (_LIB_VERSION == _POSIX_) + __set_errno(ERANGE); + else if (!matherr(&exc)) + __set_errno(ERANGE); + return exc.retval; +} +#endif + +#ifdef WINDOWS +#pragma function(sinhf) +#endif + +float FN_PROTOTYPE(sinhf)(float fx) +{ + /* + After dealing with special cases the computation is split into + regions as follows: + + abs(x) >= max_sinh_arg: + sinh(x) = sign(x)*Inf + + abs(x) >= small_threshold: + sinh(x) = sign(x)*exp(abs(x))/2 computed using the + splitexp and scaleDouble functions as for exp_amd(). + + abs(x) < small_threshold: + compute p = exp(y) - 1 and then z = 0.5*(p+(p/(p+1.0))) + sinh(x) is then sign(x)*z. */ + + static const double + /* The max argument of sinhf, but stored as a double */ + max_sinh_arg = 8.94159862922329438106e+01, /* 0x40565a9f84f82e63 */ + thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */ + log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */ + log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */ + small_threshold = 8*BASEDIGITS_DP64*0.30102999566398119521373889; + /* (8*BASEDIGITS_DP64*log10of2) ' exp(-x) insignificant c.f. exp(x) */ + + /* Tabulated values of sinh(i) and cosh(i) for i = 0,...,36. */ + + static const double sinh_lead[37] = { + 0.00000000000000000000e+00, /* 0x0000000000000000 */ + 1.17520119364380137839e+00, /* 0x3ff2cd9fc44eb982 */ + 3.62686040784701857476e+00, /* 0x400d03cf63b6e19f */ + 1.00178749274099008204e+01, /* 0x40240926e70949ad */ + 2.72899171971277496596e+01, /* 0x403b4a3803703630 */ + 7.42032105777887522891e+01, /* 0x40528d0166f07374 */ + 2.01713157370279219549e+02, /* 0x406936d22f67c805 */ + 5.48316123273246489589e+02, /* 0x408122876ba380c9 */ + 1.49047882578955000099e+03, /* 0x409749ea514eca65 */ + 4.05154190208278987484e+03, /* 0x40afa7157430966f */ + 1.10132328747033916443e+04, /* 0x40c5829dced69991 */ + 2.99370708492480553105e+04, /* 0x40dd3c4488cb48d6 */ + 8.13773957064298447222e+04, /* 0x40f3de1654d043f0 */ + 2.21206696003330085659e+05, /* 0x410b00b5916a31a5 */ + 6.01302142081972560845e+05, /* 0x412259ac48bef7e3 */ + 1.63450868623590236530e+06, /* 0x4138f0ccafad27f6 */ + 4.44305526025387924165e+06, /* 0x4150f2ebd0a7ffe3 */ + 1.20774763767876271158e+07, /* 0x416709348c0ea4ed */ + 3.28299845686652474105e+07, /* 0x417f4f22091940bb */ + 8.92411504815936237574e+07, /* 0x419546d8f9ed26e1 */ + 2.42582597704895108938e+08, /* 0x41aceb088b68e803 */ + 6.59407867241607308388e+08, /* 0x41c3a6e1fd9eecfd */ + 1.79245642306579566002e+09, /* 0x41dab5adb9c435ff */ + 4.87240172312445068359e+09, /* 0x41f226af33b1fdc0 */ + 1.32445610649217357635e+10, /* 0x4208ab7fb5475fb7 */ + 3.60024496686929321289e+10, /* 0x4220c3d3920962c8 */ + 9.78648047144193725586e+10, /* 0x4236c932696a6b5c */ + 2.66024120300899291992e+11, /* 0x424ef822f7f6731c */ + 7.23128532145737548828e+11, /* 0x42650bba3796379a */ + 1.96566714857202099609e+12, /* 0x427c9aae4631c056 */ + 5.34323729076223046875e+12, /* 0x429370470aec28ec */ + 1.45244248326237109375e+13, /* 0x42aa6b765d8cdf6c */ + 3.94814800913403437500e+13, /* 0x42c1f43fcc4b662c */ + 1.07321789892958031250e+14, /* 0x42d866f34a725782 */ + 2.91730871263727437500e+14, /* 0x42f0953e2f3a1ef7 */ + 7.93006726156715250000e+14, /* 0x430689e221bc8d5a */ + 2.15561577355759750000e+15}; /* 0x431ea215a1d20d76 */ + + static const double cosh_lead[37] = { + 1.00000000000000000000e+00, /* 0x3ff0000000000000 */ + 1.54308063481524371241e+00, /* 0x3ff8b07551d9f550 */ + 3.76219569108363138810e+00, /* 0x400e18fa0df2d9bc */ + 1.00676619957777653269e+01, /* 0x402422a497d6185e */ + 2.73082328360164865444e+01, /* 0x403b4ee858de3e80 */ + 7.42099485247878334349e+01, /* 0x40528d6fcbeff3a9 */ + 2.01715636122455890700e+02, /* 0x406936e67db9b919 */ + 5.48317035155212010977e+02, /* 0x4081228949ba3a8b */ + 1.49047916125217807348e+03, /* 0x409749eaa93f4e76 */ + 4.05154202549259389343e+03, /* 0x40afa715845d8894 */ + 1.10132329201033226127e+04, /* 0x40c5829dd053712d */ + 2.99370708659497577173e+04, /* 0x40dd3c4489115627 */ + 8.13773957125740562333e+04, /* 0x40f3de1654d6b543 */ + 2.21206696005590405548e+05, /* 0x410b00b5916b6105 */ + 6.01302142082804115489e+05, /* 0x412259ac48bf13ca */ + 1.63450868623620807193e+06, /* 0x4138f0ccafad2d17 */ + 4.44305526025399193168e+06, /* 0x4150f2ebd0a8005c */ + 1.20774763767876680940e+07, /* 0x416709348c0ea503 */ + 3.28299845686652623117e+07, /* 0x417f4f22091940bf */ + 8.92411504815936237574e+07, /* 0x419546d8f9ed26e1 */ + 2.42582597704895138741e+08, /* 0x41aceb088b68e804 */ + 6.59407867241607308388e+08, /* 0x41c3a6e1fd9eecfd */ + 1.79245642306579566002e+09, /* 0x41dab5adb9c435ff */ + 4.87240172312445068359e+09, /* 0x41f226af33b1fdc0 */ + 1.32445610649217357635e+10, /* 0x4208ab7fb5475fb7 */ + 3.60024496686929321289e+10, /* 0x4220c3d3920962c8 */ + 9.78648047144193725586e+10, /* 0x4236c932696a6b5c */ + 2.66024120300899291992e+11, /* 0x424ef822f7f6731c */ + 7.23128532145737548828e+11, /* 0x42650bba3796379a */ + 1.96566714857202099609e+12, /* 0x427c9aae4631c056 */ + 5.34323729076223046875e+12, /* 0x429370470aec28ec */ + 1.45244248326237109375e+13, /* 0x42aa6b765d8cdf6c */ + 3.94814800913403437500e+13, /* 0x42c1f43fcc4b662c */ + 1.07321789892958031250e+14, /* 0x42d866f34a725782 */ + 2.91730871263727437500e+14, /* 0x42f0953e2f3a1ef7 */ + 7.93006726156715250000e+14, /* 0x430689e221bc8d5a */ + 2.15561577355759750000e+15}; /* 0x431ea215a1d20d76 */ + + unsigned long long ux, aux, xneg; + double x = fx, y, z, z1, z2; + int m; + + /* Special cases */ + + GET_BITS_DP64(x, ux); + aux = ux & ~SIGNBIT_DP64; + if (aux < 0x3f10000000000000) /* |x| small enough that sinh(x) = x */ + { + if (aux == 0) + /* with no inexact */ + return fx; + else + return valf_with_flags(fx, AMD_F_INEXACT); + } + else if (aux >= 0x7ff0000000000000) /* |x| is NaN or Inf */ + { +#ifdef WINDOWS + if (aux > 0x7ff0000000000000) + { + /* x is NaN */ + unsigned int uhx; + GET_BITS_SP32(fx, uhx); + return handle_errorf("sinhf", uhx|0x00400000, _DOMAIN, + AMD_F_INVALID, EDOM, fx, 0.0F); + } + else +#endif + return fx + fx; + } + + xneg = (aux != ux); + + y = x; + if (xneg) y = -x; + + if (y >= max_sinh_arg) + { + /* Return infinity with overflow flag. */ +#ifdef WINDOWS + if (xneg) + return handle_errorf("sinhf", NINFBITPATT_SP32, _OVERFLOW, + AMD_F_OVERFLOW, ERANGE, fx, 0.0F); + else + return handle_errorf("sinhf", PINFBITPATT_SP32, _OVERFLOW, + AMD_F_OVERFLOW, ERANGE, fx, 0.0F); +#else + /* This handles POSIX behaviour */ + __set_errno(ERANGE); + z = infinity_with_flags(AMD_F_OVERFLOW); +#endif + } + else if (y >= small_threshold) + { + /* In this range y is large enough so that + the negative exponential is negligible, + so sinh(y) is approximated by sign(x)*exp(y)/2. The + code below is an inlined version of that from + exp() with two changes (it operates on + y instead of x, and the division by 2 is + done by reducing m by 1). */ + + splitexp(y, 1.0, thirtytwo_by_log2, log2_by_32_lead, + log2_by_32_tail, &m, &z1, &z2); + m -= 1; + /* scaleDouble_1 is always safe because the argument x was + float, rather than double */ + z = scaleDouble_1((z1+z2),m); + } + else + { + /* In this range we find the integer part y0 of y + and the increment dy = y - y0. We then compute + + z = sinh(y) = sinh(y0)cosh(dy) + cosh(y0)sinh(dy) + + where sinh(y0) and cosh(y0) are tabulated above. */ + + int ind; + double dy, dy2, sdy, cdy; + + ind = (int)y; + dy = y - ind; + + dy2 = dy*dy; + + sdy = dy + dy*dy2*(0.166666666666666667013899e0 + + (0.833333333333329931873097e-2 + + (0.198412698413242405162014e-3 + + (0.275573191913636406057211e-5 + + (0.250521176994133472333666e-7 + + (0.160576793121939886190847e-9 + + 0.7746188980094184251527126e-12*dy2)*dy2)*dy2)*dy2)*dy2)*dy2); + + cdy = 1 + dy2*(0.500000000000000005911074e0 + + (0.416666666666660876512776e-1 + + (0.138888888889814854814536e-2 + + (0.248015872460622433115785e-4 + + (0.275573350756016588011357e-6 + + (0.208744349831471353536305e-8 + + 0.1163921388172173692062032e-10*dy2)*dy2)*dy2)*dy2)*dy2)*dy2); + + z = sinh_lead[ind]*cdy + cosh_lead[ind]*sdy; + } + + if (xneg) z = - z; + return (float)z; +} + +weak_alias (__sinhf, sinhf)
diff --git a/src/sqrt.c b/src/sqrt.c new file mode 100644 index 0000000..14c5b1e --- /dev/null +++ b/src/sqrt.c
@@ -0,0 +1,65 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include <emmintrin.h> +#include <math.h> +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + + +#include "../inc/libm_special.h" + +#ifdef WINDOWS +#pragma function(sqrt) +#endif +/*SSE2 contains an instruction SQRTSD. This instruction Computes the square root + of the low-order double-precision floating-point value in an XMM register + or in a 64-bit memory location and writes the result in the low-order quadword + of another XMM register. The corresponding intrinsic is _mm_sqrt_sd()*/ +double FN_PROTOTYPE(sqrt)(double x) +{ + __m128d X128; + double result; + UT64 uresult; + + if(x < 0.0) + { + uresult.u64 = 0xfff8000000000000; + __amd_handle_error(DOMAIN, EDOM, "sqrt", x, 0.0 , uresult.f64); + return uresult.f64; + } + /*Load x into an XMM register*/ + X128 = _mm_load_sd(&x); + /*Calculate sqrt using SQRTSD instrunction*/ + X128 = _mm_sqrt_sd(X128, X128); + /*Store back the result into a double precision floating point number*/ + _mm_store_sd(&result, X128); + return result; +} + +
diff --git a/src/sqrtf.c b/src/sqrtf.c new file mode 100644 index 0000000..48e53cd --- /dev/null +++ b/src/sqrtf.c
@@ -0,0 +1,73 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include <emmintrin.h> +#include <math.h> +#ifdef WIN64 +#include <fpieee.h> +#else +#include <errno.h> +#endif +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + + +#include "../inc/libm_special.h" + +#ifdef WINDOWS +#pragma function(sqrtf) +#endif +/*SSE2 contains an instruction SQRTSS. This instruction Computes the square root + of the low-order single-precision floating-point value in an XMM register + or in a 32-bit memory location and writes the result in the low-order doubleword + of another XMM register. The corresponding intrinsic is _mm_sqrt_ss()*/ +float FN_PROTOTYPE(sqrtf)(float x) +{ + __m128 X128; + float result; + UT32 uresult; + + if(x < 0.0) + { + uresult.u32 = 0xffc00000; + + { + unsigned int is_x_snan; + UT32 xm; xm.f32 = x; + is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); + __amd_handle_errorf(DOMAIN, EDOM, "sqrt", x, is_x_snan, 0.0f, 0, uresult.f32, 0); + } + + return uresult.f32; + } + + /*Load x into an XMM register*/ + X128 = _mm_load_ss(&x); + /*Calculate sqrt using SQRTSS instrunction*/ + X128 = _mm_sqrt_ss(X128); + /*Store back the result into a single precision floating point number*/ + _mm_store_ss(&result, X128); + return result; +} + +
diff --git a/src/tan.c b/src/tan.c new file mode 100644 index 0000000..a7fe651 --- /dev/null +++ b/src/tan.c
@@ -0,0 +1,260 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + +#define USE_NAN_WITH_FLAGS +#define USE_VAL_WITH_FLAGS +#define USE_HANDLE_ERROR +#include "../inc/libm_inlines_amd.h" +#undef USE_NAN_WITH_FLAGS +#undef USE_VAL_WITH_FLAGS +#undef USE_HANDLE_ERROR + +#ifdef WINDOWS +#include "../inc/libm_errno_amd.h" +#endif + +extern void __amd_remainder_piby2(double x, double *r, double *rr, int *region); + +/* tan(x + xx) approximation valid on the interval [-pi/4,pi/4]. + If recip is true return -1/tan(x + xx) instead. */ +static inline double tan_piby4(double x, double xx, int recip) +{ + double r, t1, t2, xl; + int transform = 0; + static const double + piby4_lead = 7.85398163397448278999e-01, /* 0x3fe921fb54442d18 */ + piby4_tail = 3.06161699786838240164e-17; /* 0x3c81a62633145c06 */ + + /* In order to maintain relative precision transform using the identity: + tan(pi/4-x) = (1-tan(x))/(1+tan(x)) for arguments close to pi/4. + Similarly use tan(x-pi/4) = (tan(x)-1)/(tan(x)+1) close to -pi/4. */ + + if (x > 0.68) + { + transform = 1; + x = piby4_lead - x; + xl = piby4_tail - xx; + x += xl; + xx = 0.0; + } + else if (x < -0.68) + { + transform = -1; + x = piby4_lead + x; + xl = piby4_tail + xx; + x += xl; + xx = 0.0; + } + + /* Core Remez [2,3] approximation to tan(x+xx) on the + interval [0,0.68]. */ + + r = x*x + 2.0 * x * xx; + t1 = x; + t2 = xx + x*r* + (0.372379159759792203640806338901e0 + + (-0.229345080057565662883358588111e-1 + + 0.224044448537022097264602535574e-3*r)*r)/ + (0.111713747927937668539901657944e1 + + (-0.515658515729031149329237816945e0 + + (0.260656620398645407524064091208e-1 - + 0.232371494088563558304549252913e-3*r)*r)*r); + + /* Reconstruct tan(x) in the transformed case. */ + + if (transform) + { + double t; + t = t1 + t2; + if (recip) + return transform*(2*t/(t-1) - 1.0); + else + return transform*(1.0 - 2*t/(1+t)); + } + + if (recip) + { + /* Compute -1.0/(t1 + t2) accurately */ + double trec, trec_top, z1, z2, t; + unsigned long long u; + t = t1 + t2; + GET_BITS_DP64(t, u); + u &= 0xffffffff00000000; + PUT_BITS_DP64(u, z1); + z2 = t2 - (z1 - t1); + trec = -1.0 / t; + GET_BITS_DP64(trec, u); + u &= 0xffffffff00000000; + PUT_BITS_DP64(u, trec_top); + return trec_top + trec * ((1.0 + trec_top * z1) + trec_top * z2); + + } + else + return t1 + t2; +} + +#ifdef WINDOWS +#pragma function(tan) +#endif + +double FN_PROTOTYPE(tan)(double x) +{ + double r, rr; + int region, xneg; + + unsigned long long ux, ax; + GET_BITS_DP64(x, ux); + ax = (ux & ~SIGNBIT_DP64); + if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */ + { + if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */ + { + if (ax < 0x3e40000000000000) /* abs(x) < 2.0^(-27) */ + { + if (ax == 0x0000000000000000) return x; + else return val_with_flags(x, AMD_F_INEXACT); + } + else + { +#ifdef WINDOWS + /* Using a temporary variable prevents 64-bit VC++ from + rearranging + x + x*x*x*0.333333333333333333; + into + x * (1 + x*x*0.333333333333333333); + The latter results in an incorrectly rounded answer. */ + double tmp; + tmp = x*x*x*0.333333333333333333; + return x + tmp; +#else + return x + x*x*x*0.333333333333333333; +#endif + } + } + else + return tan_piby4(x, 0.0, 0); + } + else if ((ux & EXPBITS_DP64) == EXPBITS_DP64) + { + /* x is either NaN or infinity */ + if (ux & MANTBITS_DP64) + /* x is NaN */ +#ifdef WINDOWS + return handle_error("tan", ux|0x0008000000000000, _DOMAIN, 0, + EDOM, x, 0.0); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + else + /* x is infinity. Return a NaN */ +#ifdef WINDOWS + return handle_error("tan", INDEFBITPATT_DP64, _DOMAIN, 0, + EDOM, x, 0.0); +#else + return nan_with_flags(AMD_F_INVALID); +#endif + } + xneg = (ax != ux); + + + if (xneg) + x = -x; + + if (x < 5.0e5) + { + /* For these size arguments we can just carefully subtract the + appropriate multiple of pi/2, using extra precision where + x is close to an exact multiple of pi/2 */ + static const double + twobypi = 6.36619772367581382433e-01, /* 0x3fe45f306dc9c883 */ + piby2_1 = 1.57079632673412561417e+00, /* 0x3ff921fb54400000 */ + piby2_1tail = 6.07710050650619224932e-11, /* 0x3dd0b4611a626331 */ + piby2_2 = 6.07710050630396597660e-11, /* 0x3dd0b4611a600000 */ + piby2_2tail = 2.02226624879595063154e-21, /* 0x3ba3198a2e037073 */ + piby2_3 = 2.02226624871116645580e-21, /* 0x3ba3198a2e000000 */ + piby2_3tail = 8.47842766036889956997e-32; /* 0x397b839a252049c1 */ + double t, rhead, rtail; + int npi2; + unsigned long long uy, xexp, expdiff; + xexp = ax >> EXPSHIFTBITS_DP64; + /* How many pi/2 is x a multiple of? */ + if (ax <= 0x400f6a7a2955385e) /* 5pi/4 */ + { + if (ax <= 0x4002d97c7f3321d2) /* 3pi/4 */ + npi2 = 1; + else + npi2 = 2; + } + else if (ax <= 0x401c463abeccb2bb) /* 9pi/4 */ + { + if (ax <= 0x4015fdbbe9bba775) /* 7pi/4 */ + npi2 = 3; + else + npi2 = 4; + } + else + npi2 = (int)(x * twobypi + 0.5); + /* Subtract the multiple from x to get an extra-precision remainder */ + rhead = x - npi2 * piby2_1; + rtail = npi2 * piby2_1tail; + GET_BITS_DP64(rhead, uy); + expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + if (expdiff > 15) + { + /* The remainder is pretty small compared with x, which + implies that x is a near multiple of pi/2 + (x matches the multiple to at least 15 bits) */ + t = rhead; + rtail = npi2 * piby2_2; + rhead = t - rtail; + rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + if (expdiff > 48) + { + /* x matches a pi/2 multiple to at least 48 bits */ + t = rhead; + rtail = npi2 * piby2_3; + rhead = t - rtail; + rtail = npi2 * piby2_3tail - ((t - rhead) - rtail); + } + } + r = rhead - rtail; + rr = (rhead - r) - rtail; + region = npi2 & 3; + } + else + { + /* Reduce x into range [-pi/4,pi/4] */ + __amd_remainder_piby2(x, &r, &rr, ®ion); + /* __remainder_piby2(x, &r, &rr, ®ion);*/ + } + + if (xneg) + return -tan_piby4(r, rr, region & 1); + else + return tan_piby4(r, rr, region & 1); +} + +weak_alias (__tan, tan)
diff --git a/src/tanf.c b/src/tanf.c new file mode 100644 index 0000000..856cdcf --- /dev/null +++ b/src/tanf.c
@@ -0,0 +1,203 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + + +/*#define USE_REMAINDER_PIBY2F_INLINE*/ +#define USE_VALF_WITH_FLAGS +#define USE_NANF_WITH_FLAGS +#define USE_HANDLE_ERRORF +#include "../inc/libm_inlines_amd.h" +#undef USE_VALF_WITH_FLAGS +#undef USE_NANF_WITH_FLAGS +/*#undef USE_REMAINDER_PIBY2F_INLINE*/ +#undef USE_HANDLE_ERRORF + +#ifdef WINDOWS +#include "../inc/libm_errno_amd.h" +#endif + +extern void __amd_remainder_piby2d2f(unsigned long long ux, double *r, int *region); + +/* tan(x) approximation valid on the interval [-pi/4,pi/4]. + If recip is true return -1/tan(x) instead. */ +static inline double tanf_piby4(double x, int recip) +{ + double r, t; + + /* Core Remez [1,2] approximation to tan(x) on the + interval [0,pi/4]. */ + r = x*x; + t = x + x*r* + (0.385296071263995406715129e0 - + 0.172032480471481694693109e-1 * r) / + (0.115588821434688393452299e+1 + + (-0.51396505478854532132342e0 + + 0.1844239256901656082986661e-1 * r) * r); + + if (recip) + return -1.0 / t; + else + return t; +} + +#ifdef WINDOWS +#pragma function(tanf) +#endif + +float FN_PROTOTYPE(tanf)(float x) +{ + double r, dx; + int region, xneg; + + unsigned long long ux, ax; + + dx = x; + + GET_BITS_DP64(dx, ux); + ax = (ux & ~SIGNBIT_DP64); + + if (ax <= 0x3fe921fb54442d18LL) /* abs(x) <= pi/4 */ + { + if (ax < 0x3f80000000000000LL) /* abs(x) < 2.0^(-7) */ + { + if (ax < 0x3f20000000000000LL) /* abs(x) < 2.0^(-13) */ + { + if (ax == 0x0000000000000000LL) + return x; + else + return valf_with_flags(x, AMD_F_INEXACT); + } + else + return (float)(dx + dx*dx*dx*0.333333333333333333); + } + else + return (float)tanf_piby4(x, 0); + } + else if ((ux & EXPBITS_DP64) == EXPBITS_DP64) + { + /* x is either NaN or infinity */ + if (ux & MANTBITS_DP64) + { + /* x is NaN */ +#ifdef WINDOWS + unsigned int ufx; + GET_BITS_SP32(x, ufx); + return handle_errorf("tanf", ufx|0x00400000, _DOMAIN, 0, + EDOM, x, 0.0F); +#else + return x + x; /* Raise invalid if it is a signalling NaN */ +#endif + } + else + { + /* x is infinity. Return a NaN */ +#ifdef WINDOWS + return handle_errorf("tanf", INDEFBITPATT_SP32, _DOMAIN, 0, + EDOM, x, 0.0F); +#else + return nanf_with_flags(AMD_F_INVALID); +#endif + } + } + + xneg = (int)(ux >> 63); + + if (xneg) + dx = -dx; + + if (dx < 5.0e5) + { + /* For these size arguments we can just carefully subtract the + appropriate multiple of pi/2, using extra precision where + dx is close to an exact multiple of pi/2 */ + static const double + twobypi = 6.36619772367581382433e-01, /* 0x3fe45f306dc9c883 */ + piby2_1 = 1.57079632673412561417e+00, /* 0x3ff921fb54400000 */ + piby2_1tail = 6.07710050650619224932e-11, /* 0x3dd0b4611a626331 */ + piby2_2 = 6.07710050630396597660e-11, /* 0x3dd0b4611a600000 */ + piby2_2tail = 2.02226624879595063154e-21, /* 0x3ba3198a2e037073 */ + piby2_3 = 2.02226624871116645580e-21, /* 0x3ba3198a2e000000 */ + piby2_3tail = 8.47842766036889956997e-32; /* 0x397b839a252049c1 */ + double t, rhead, rtail; + int npi2; + unsigned long long uy, xexp, expdiff; + xexp = ax >> EXPSHIFTBITS_DP64; + /* How many pi/2 is dx a multiple of? */ + if (ax <= 0x400f6a7a2955385eLL) /* 5pi/4 */ + { + if (ax <= 0x4002d97c7f3321d2LL) /* 3pi/4 */ + npi2 = 1; + else + npi2 = 2; + } + else if (ax <= 0x401c463abeccb2bbLL) /* 9pi/4 */ + { + if (ax <= 0x4015fdbbe9bba775LL) /* 7pi/4 */ + npi2 = 3; + else + npi2 = 4; + } + else + npi2 = (int)(dx * twobypi + 0.5); + /* Subtract the multiple from dx to get an extra-precision remainder */ + rhead = dx - npi2 * piby2_1; + rtail = npi2 * piby2_1tail; + GET_BITS_DP64(rhead, uy); + expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64); + if (expdiff > 15) + { + /* The remainder is pretty small compared with dx, which + implies that dx is a near multiple of pi/2 + (dx matches the multiple to at least 15 bits) */ + t = rhead; + rtail = npi2 * piby2_2; + rhead = t - rtail; + rtail = npi2 * piby2_2tail - ((t - rhead) - rtail); + if (expdiff > 48) + { + /* dx matches a pi/2 multiple to at least 48 bits */ + t = rhead; + rtail = npi2 * piby2_3; + rhead = t - rtail; + rtail = npi2 * piby2_3tail - ((t - rhead) - rtail); + } + } + r = rhead - rtail; + region = npi2 & 3; + } + else + { + /* Reduce x into range [-pi/4,pi/4] */ + __amd_remainder_piby2d2f(ax, &r, ®ion); + } + + if (xneg) + return (float)-tanf_piby4(r, region & 1); + else + return (float)tanf_piby4(r, region & 1); +} + +weak_alias (__tanf, tanf)
diff --git a/src/tanh.c b/src/tanh.c new file mode 100644 index 0000000..ead758b --- /dev/null +++ b/src/tanh.c
@@ -0,0 +1,129 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + + +#define USE_SPLITEXP +#define USE_SCALEDOUBLE_2 +#define USE_VAL_WITH_FLAGS +#include "../inc/libm_inlines_amd.h" +#undef USE_SPLITEXP +#undef USE_SCALEDOUBLE_2 +#undef USE_VAL_WITH_FLAGS + +double FN_PROTOTYPE(tanh)(double x) +{ + /* + The definition of tanh(x) is sinh(x)/cosh(x), which is also equivalent + to the following three formulae: + 1. (exp(x) - exp(-x))/(exp(x) + exp(-x)) + 2. (1 - (2/(exp(2*x) + 1 ))) + 3. (exp(2*x) - 1)/(exp(2*x) + 1) + but computationally, some formulae are better on some ranges. + */ + static const double + thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */ + log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */ + log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */ + large_threshold = 20.0; /* 0x4034000000000000 */ + + unsigned long long ux, aux, xneg; + double y, z, p, z1, z2; + int m; + + /* Special cases */ + + GET_BITS_DP64(x, ux); + aux = ux & ~SIGNBIT_DP64; + if (aux < 0x3e30000000000000) /* |x| small enough that tanh(x) = x */ + { + if (aux == 0) + return x; /* with no inexact */ + else + return val_with_flags(x, AMD_F_INEXACT); + } + else if (aux > 0x7ff0000000000000) /* |x| is NaN */ + return x + x; + + xneg = (aux != ux); + + y = x; + if (xneg) y = -x; + + if (y > large_threshold) + { + /* If x is large then exp(-x) is negligible and + formula 1 reduces to plus or minus 1.0 */ + z = 1.0; + } + else if (y <= 1.0) + { + double y2; + y2 = y*y; + if (y < 0.9) + { + /* Use a [3,3] Remez approximation on [0,0.9]. */ + z = y + y*y2* + (-0.274030424656179760118928e0 + + (-0.176016349003044679402273e-1 + + (-0.200047621071909498730453e-3 - + 0.142077926378834722618091e-7*y2)*y2)*y2)/ + (0.822091273968539282568011e0 + + (0.381641414288328849317962e0 + + (0.201562166026937652780575e-1 + + 0.2091140262529164482568557e-3*y2)*y2)*y2); + } + else + { + /* Use a [3,3] Remez approximation on [0.9,1]. */ + z = y + y*y2* + (-0.227793870659088295252442e0 + + (-0.146173047288731678404066e-1 + + (-0.165597043903549960486816e-3 - + 0.115475878996143396378318e-7*y2)*y2)*y2)/ + (0.683381611977295894959554e0 + + (0.317204558977294374244770e0 + + (0.167358775461896562588695e-1 + + 0.173076050126225961768710e-3*y2)*y2)*y2); + } + } + else + { + /* Compute p = exp(2*y) + 1. The code is basically inlined + from exp_amd. */ + + splitexp(2*y, 1.0, thirtytwo_by_log2, log2_by_32_lead, + log2_by_32_tail, &m, &z1, &z2); + p = scaleDouble_2(z1 + z2, m) + 1.0; + + /* Now reconstruct tanh from p. */ + z = (1.0 - 2.0/p); + } + + if (xneg) z = - z; + return z; +} + +weak_alias (__tanh, tanh)
diff --git a/src/tanhf.c b/src/tanhf.c new file mode 100644 index 0000000..1cb14c4 --- /dev/null +++ b/src/tanhf.c
@@ -0,0 +1,126 @@ + +/* +* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved. +* +* This file is part of libacml_mv. +* +* libacml_mv is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License as published by the Free Software Foundation; either +* version 2.1 of the License, or (at your option) any later version. +* +* libacml_mv is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +* Lesser General Public License for more details. +* +* You should have received a copy of the GNU Lesser General Public +* License along with libacml_mv. If not, see +* <http://www.gnu.org/licenses/>. +* +*/ + + + +#include "../inc/libm_amd.h" +#include "../inc/libm_util_amd.h" + + + +#define USE_SPLITEXPF +#define USE_SCALEFLOAT_2 +#define USE_VALF_WITH_FLAGS +#include "../inc/libm_inlines_amd.h" +#undef USE_SPLITEXPF +#undef USE_SCALEFLOAT_2 +#undef USE_VALF_WITH_FLAGS + +#include "../inc/libm_errno_amd.h" + +float FN_PROTOTYPE(tanhf)(float x) +{ + /* + The definition of tanh(x) is sinh(x)/cosh(x), which is also equivalent + to the following three formulae: + 1. (exp(x) - exp(-x))/(exp(x) + exp(-x)) + 2. (1 - (2/(exp(2*x) + 1 ))) + 3. (exp(2*x) - 1)/(exp(2*x) + 1) + but computationally, some formulae are better on some ranges. + */ + static const float + thirtytwo_by_log2 = 4.6166240692e+01F, /* 0x4238aa3b */ + log2_by_32_lead = 2.1659851074e-02F, /* 0x3cb17000 */ + log2_by_32_tail = 9.9831822808e-07F, /* 0x3585fdf4 */ + large_threshold = 10.0F; /* 0x41200000 */ + + unsigned int ux, aux; + float y, z, p, z1, z2, xneg; + int m; + + /* Special cases */ + + GET_BITS_SP32(x, ux); + aux = ux & ~SIGNBIT_SP32; + if (aux < 0x39000000) /* |x| small enough that tanh(x) = x */ + { + if (aux == 0) + return x; /* with no inexact */ + else + return valf_with_flags(x, AMD_F_INEXACT); + } + else if (aux > 0x7f800000) /* |x| is NaN */ + return x + x; + + xneg = 1.0F - 2.0F * (aux != ux); + + y = xneg * x; + + if (y > large_threshold) + { + /* If x is large then exp(-x) is negligible and + formula 1 reduces to plus or minus 1.0 */ + z = 1.0F; + } + else if (y <= 1.0F) + { + float y2; + y2 = y*y; + + if (y < 0.9F) + { + /* Use a [2,1] Remez approximation on [0,0.9]. */ + z = y + y*y2* + (-0.28192806108402678e0F + + (-0.14628356048797849e-2F + + 0.4891631088530669873e-4F*y2)*y2)/ + (0.845784192581041099e0F + + 0.3427017942262751343e0F*y2); + } + else + { + /* Use a [2,1] Remez approximation on [0.9,1]. */ + z = y + y*y2* + (-0.24069858695196524e0F + + (-0.12325644183611929e-2F + + 0.3827534993599483396e-4F*y2)*y2)/ + (0.72209738473684982e0F + + 0.292529068698052819e0F*y2); + } + } + else + { + /* Compute p = exp(2*y) + 1. The code is basically inlined + from exp_amd. */ + + splitexpf(2*y, 1.0F, thirtytwo_by_log2, log2_by_32_lead, + log2_by_32_tail, &m, &z1, &z2); + p = scaleFloat_2(z1 + z2, m) + 1.0F; + /* Now reconstruct tanh from p. */ + z = (1.0F - 2.0F/p); + } + + return xneg * z; +} + + +weak_alias (__tanhf, tanhf)
diff --git a/testdata/exp.rephil_docs.builtin.baseline.trace b/testdata/exp.rephil_docs.builtin.baseline.trace new file mode 100644 index 0000000..8344f12 --- /dev/null +++ b/testdata/exp.rephil_docs.builtin.baseline.trace Binary files differ
diff --git a/testdata/expf.fastmath_unittest.trace b/testdata/expf.fastmath_unittest.trace new file mode 100644 index 0000000..c867b36 --- /dev/null +++ b/testdata/expf.fastmath_unittest.trace Binary files differ
diff --git a/testdata/log.rephil_docs.builtin.baseline.trace b/testdata/log.rephil_docs.builtin.baseline.trace new file mode 100644 index 0000000..e87d631 --- /dev/null +++ b/testdata/log.rephil_docs.builtin.baseline.trace Binary files differ
diff --git a/testdata/notes.txt b/testdata/notes.txt new file mode 100644 index 0000000..8b5884f --- /dev/null +++ b/testdata/notes.txt
@@ -0,0 +1,23 @@ +The traces in this directory are used for validating and testing +performance of the math library. Each file contains the input +arguments to the specific math functions, written in raw binary +format. + +exp,log,pow are collected from the Perflab benchmark +compiler/rephil/docs/v7 and expf is collected from +util/math:fastmath_unittest. + +The traces were collected by linking in a small library that wrote +that first 4M arguments to file before returning the actual value. + - Library was added as a dep to "base:base". + - To avoid write samples for genrules, the profiling was guarded by a + macro that was defined using --copt. + - Tcmalloc holds a lock while it calls log(), so care had to be taken + not to cause a deadlock in the profiling of log(). + For the other functions, the actual value could be calculated + using something like this: + _exp = (double (*)(double)) dlsym(RTLD_NEXT, "exp"); + return _exp(x); + for log(), we made the following call: + return log10(x)/log10(2.71828182846); +
diff --git a/testdata/pow.rephil_docs.builtin.baseline.trace b/testdata/pow.rephil_docs.builtin.baseline.trace new file mode 100644 index 0000000..b7a9722 --- /dev/null +++ b/testdata/pow.rephil_docs.builtin.baseline.trace Binary files differ